Improve download link handling
The previous method relied on the main "download link" in the list page. But this link was broken a solid 1/4 of the time, and far more often for some artists. Instead, during DB build, grab and parse each actual song page too, and grab from it all possible download links. Use a ThreadPoolExecutor to do this in a reasonable amount of time (default of 10 workers, but user configurable). Then when downloading, iterate over all download links, or provide some user options for filtering these by ID or description.
This commit is contained in:
parent
3a0ef3dcc6
commit
6ec8923336
13
README.md
13
README.md
|
@ -13,11 +13,6 @@ standardized format.
|
||||||
To use the tool, first use the "database" command to build or modify your local JSON database, then use the
|
To use the tool, first use the "database" command to build or modify your local JSON database, then use the
|
||||||
"download" command to download songs.
|
"download" command to download songs.
|
||||||
|
|
||||||
To avoid overloading or abusing the C3DB website, this tool operates exclusively in sequential mode by design; at
|
|
||||||
most one page is scraped (for "database build") or song downloaded (for "download") at once. Additionally, the tool
|
|
||||||
design ensures that the JSON database of songs is stored locally, so it only needs to be built once and then is
|
|
||||||
reused to perform actual downloads without putting further load on the website.
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
1. Install the Python3 requirements from `requirements.txt`.
|
1. Install the Python3 requirements from `requirements.txt`.
|
||||||
|
@ -39,8 +34,9 @@ fetch all avilable songs for all games, and either specify it with the `-u`/`--b
|
||||||
environment variable `C3DBDL_BASE_URL`.
|
environment variable `C3DBDL_BASE_URL`.
|
||||||
|
|
||||||
1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount
|
1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount
|
||||||
of time to complete as all pages of the chosen base URL are scanned. Note that if you cancel this process, no
|
of time to complete as all pages of the chosen base URL, and all song pages (30,000+) are scanned. Note that if
|
||||||
data will be saved, so let it complete!
|
you cancel this process, no data will be saved, so let it complete! The default concurrency setting should make
|
||||||
|
this relatively quick but YMMV.
|
||||||
|
|
||||||
1. Download any song(s) you want with `c3dbdl [options] download [options]`.
|
1. Download any song(s) you want with `c3dbdl [options] download [options]`.
|
||||||
|
|
||||||
|
@ -86,6 +82,9 @@ Downloading song "Rush - Sweet Miracle" by ejthedj...
|
||||||
Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle...
|
Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
In addition to the above filters, within each song may be more than one download link. To filter these links,
|
||||||
|
use the "-i"/"--download-id" and "-d"/"--download-descr" (see the help for details).
|
||||||
|
|
||||||
Feel free to experiment.
|
Feel free to experiment.
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
288
c3dbdl
288
c3dbdl
|
@ -10,11 +10,87 @@ from difflib import unified_diff
|
||||||
from colorama import Fore
|
from colorama import Fore
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
from urllib.error import HTTPError
|
from urllib.error import HTTPError
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
|
||||||
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120)
|
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120)
|
||||||
|
|
||||||
|
def fetchSongData(entry):
|
||||||
|
song_entry = dict()
|
||||||
|
messages = list()
|
||||||
|
|
||||||
def buildDatabase(pages=None):
|
for idx, td in enumerate(entry.find_all('td')):
|
||||||
|
if idx == 2:
|
||||||
|
# Artist
|
||||||
|
song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
|
||||||
|
elif idx == 3:
|
||||||
|
# Song
|
||||||
|
song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
|
||||||
|
song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
|
||||||
|
# Song page
|
||||||
|
tmp_links = td.find_all('a', href=True)
|
||||||
|
for link in tmp_links:
|
||||||
|
if link.get('href'):
|
||||||
|
song_entry["song_link"] = link.get('href')
|
||||||
|
break
|
||||||
|
elif idx == 4:
|
||||||
|
# Genre
|
||||||
|
song_entry["genre"] = td.find('a').get_text().strip()
|
||||||
|
elif idx == 5:
|
||||||
|
# Year
|
||||||
|
song_entry["year"] = td.find('a').get_text().strip()
|
||||||
|
elif idx == 6:
|
||||||
|
# Length
|
||||||
|
song_entry["length"] = td.find('a').get_text().strip()
|
||||||
|
elif idx == 8:
|
||||||
|
# Author (of chart)
|
||||||
|
song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
|
||||||
|
|
||||||
|
if song_entry and song_entry['author'] and song_entry['title'] and song_entry["song_link"]:
|
||||||
|
messages.append(f"> Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
|
||||||
|
for entry_type in ["artist", "album", "genre", "year", "length"]:
|
||||||
|
if not song_entry[entry_type]:
|
||||||
|
song_entry[entry_type] = "None"
|
||||||
|
|
||||||
|
# Get download links from the actual song page
|
||||||
|
attempts = 1
|
||||||
|
sp = None
|
||||||
|
while attempts <= 3:
|
||||||
|
try:
|
||||||
|
messages.append(f"Parsing song page {song_entry['song_link']} (attempt {attempts}/3)...")
|
||||||
|
sp = requests.get(song_entry["song_link"])
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
sleep(attempts)
|
||||||
|
attempts += 1
|
||||||
|
if sp is None or sp.status_code != 200:
|
||||||
|
messages.append("Failed to fetch song page, aborting")
|
||||||
|
return None
|
||||||
|
|
||||||
|
song_parsed_html = BeautifulSoup(sp.text, 'html.parser')
|
||||||
|
|
||||||
|
download_section = song_parsed_html.find('div', attrs={"class": "portlet light bg-inverse"})
|
||||||
|
download_links = download_section.find_all('a', href=True)
|
||||||
|
dl_links = list()
|
||||||
|
for link_entry in download_links:
|
||||||
|
link = link_entry.get('href')
|
||||||
|
description = link_entry.get_text().strip()
|
||||||
|
if not "c3universe.com" in link:
|
||||||
|
continue
|
||||||
|
messages.append(f"Found download link: {link} ({description})")
|
||||||
|
dl_links.append({
|
||||||
|
"link": link,
|
||||||
|
"description": description,
|
||||||
|
})
|
||||||
|
if not dl_links:
|
||||||
|
messages.append("Found no c3universe.com download links for song, not adding to database")
|
||||||
|
return None
|
||||||
|
song_entry["dl_links"] = dl_links
|
||||||
|
|
||||||
|
# Append to the database
|
||||||
|
return messages, song_entry
|
||||||
|
|
||||||
|
|
||||||
|
def buildDatabase(pages, concurrency):
|
||||||
found_songs = []
|
found_songs = []
|
||||||
|
|
||||||
if pages is None:
|
if pages is None:
|
||||||
|
@ -46,106 +122,109 @@ def buildDatabase(pages=None):
|
||||||
|
|
||||||
table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody')
|
table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody')
|
||||||
|
|
||||||
|
entries = list()
|
||||||
for entry in table_html.find_all('tr', attrs={'class':'odd'}):
|
for entry in table_html.find_all('tr', attrs={'class':'odd'}):
|
||||||
if len(entry) < 1:
|
if len(entry) < 1:
|
||||||
break
|
break
|
||||||
|
entries.append(entry)
|
||||||
|
|
||||||
song_entry = dict()
|
click.echo("Fetching and parsing song pages...")
|
||||||
|
with ThreadPoolExecutor(max_workers=concurrency) as executor:
|
||||||
for idx, td in enumerate(entry.find_all('td')):
|
future_to_song = {executor.submit(fetchSongData, entry): entry for entry in entries}
|
||||||
if idx == 1:
|
for future in as_completed(future_to_song):
|
||||||
# Download link
|
try:
|
||||||
song_entry["dl_link"] = td.find('a', attrs={'target':'_blank'}).get('href')
|
messages, song = future.result()
|
||||||
elif idx == 2:
|
click.echo('\n'.join(messages))
|
||||||
# Artist
|
if song is None:
|
||||||
song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
|
continue
|
||||||
elif idx == 3:
|
found_songs.append(song)
|
||||||
# Song
|
except Exception:
|
||||||
song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
|
continue
|
||||||
song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
|
|
||||||
elif idx == 4:
|
|
||||||
# Genre
|
|
||||||
song_entry["genre"] = td.find('a').get_text().strip()
|
|
||||||
elif idx == 5:
|
|
||||||
# Year
|
|
||||||
song_entry["year"] = td.find('a').get_text().strip()
|
|
||||||
elif idx == 6:
|
|
||||||
# Length
|
|
||||||
song_entry["length"] = td.find('a').get_text().strip()
|
|
||||||
elif idx == 8:
|
|
||||||
# Author (of chart)
|
|
||||||
song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
|
|
||||||
|
|
||||||
if song_entry and song_entry['author'] and song_entry['title']:
|
|
||||||
click.echo(f"Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
|
|
||||||
for entry_type in ["artist", "album", "genre", "year", "length"]:
|
|
||||||
if not song_entry[entry_type]:
|
|
||||||
song_entry[entry_type] = "None"
|
|
||||||
found_songs.append(song_entry)
|
|
||||||
|
|
||||||
return found_songs
|
return found_songs
|
||||||
|
|
||||||
|
|
||||||
def downloadSong(destination, filename, entry):
|
def downloadSong(destination, filename, entry, dlid, dldesc):
|
||||||
click.echo(f"""Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
|
click.echo(f"""> Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
|
||||||
|
|
||||||
try:
|
if dlid is None:
|
||||||
p = requests.get(entry['dl_link'])
|
dl_links = entry['dl_links']
|
||||||
if p.status_code != 200:
|
else:
|
||||||
raise HTTPError(entry['dl_link'], p.status_code, "", None, None)
|
try:
|
||||||
|
dl_links = [entry['dl_links'][dlid - 1]]
|
||||||
|
except Exception:
|
||||||
|
click.echo(f"Invalid download link ID {dlid}.")
|
||||||
|
return
|
||||||
|
|
||||||
parsed_html = BeautifulSoup(p.text, 'html.parser')
|
if dldesc is not None:
|
||||||
download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
|
new_dl_links = list()
|
||||||
except Exception as e:
|
for entry in dl_links:
|
||||||
click.echo(f"Failed parsing or retrieving HTML link: {e}")
|
if dldesc in entry['description']:
|
||||||
return None
|
new_dl_links.append(entry)
|
||||||
|
dl_links = new_dl_links
|
||||||
|
|
||||||
download_filename = filename.format(
|
if not dl_links:
|
||||||
genre=entry['genre'],
|
click.echo(f'No download link matching description "{dldesc}" found.')
|
||||||
artist=entry['artist'],
|
return
|
||||||
album=entry['album'],
|
|
||||||
title=entry['title'],
|
|
||||||
year=entry['year'],
|
|
||||||
author=entry['author'],
|
|
||||||
orig_name=download_url.split('/')[-1],
|
|
||||||
)
|
|
||||||
download_filename = f"{destination}/{download_filename}"
|
|
||||||
download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
|
|
||||||
|
|
||||||
if os.path.exists(download_filename):
|
for dl_link in dl_links:
|
||||||
click.echo(f"File exists at {download_filename}")
|
try:
|
||||||
return None
|
p = requests.get(dl_link['link'])
|
||||||
|
if p.status_code != 200:
|
||||||
|
raise HTTPError(dl_link['link'], p.status_code, "", None, None)
|
||||||
|
|
||||||
click.echo(f"""Downloading from {download_url}...""")
|
parsed_html = BeautifulSoup(p.text, 'html.parser')
|
||||||
attempts = 1
|
download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
|
||||||
p = None
|
except Exception as e:
|
||||||
try:
|
click.echo(f"Failed parsing or retrieving HTML link: {e}")
|
||||||
with requests.get(download_url, stream=True) as r:
|
continue
|
||||||
while attempts <= 3:
|
|
||||||
try:
|
|
||||||
r.raise_for_status()
|
|
||||||
break
|
|
||||||
except Exception:
|
|
||||||
click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3")
|
|
||||||
sleep(attempts)
|
|
||||||
attempts += 1
|
|
||||||
if r is None or r.status_code != 200:
|
|
||||||
if r:
|
|
||||||
code = r.status_code
|
|
||||||
else:
|
|
||||||
code = "-1"
|
|
||||||
raise HTTPError(download_url, code, "", None, None)
|
|
||||||
|
|
||||||
if not os.path.exists(download_path):
|
download_filename = filename.format(
|
||||||
os.makedirs(download_path)
|
genre=entry['genre'],
|
||||||
|
artist=entry['artist'],
|
||||||
|
album=entry['album'],
|
||||||
|
title=entry['title'],
|
||||||
|
year=entry['year'],
|
||||||
|
author=entry['author'],
|
||||||
|
orig_name=download_url.split('/')[-1],
|
||||||
|
)
|
||||||
|
download_filename = f"{destination}/{download_filename}"
|
||||||
|
download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
|
||||||
|
|
||||||
with open(download_filename, 'wb') as f:
|
click.echo(f"""Downloading file "{dl_link['description']}" from {download_url}...""")
|
||||||
for chunk in r.iter_content(chunk_size=8192):
|
if os.path.exists(download_filename):
|
||||||
f.write(chunk)
|
click.echo(f"File exists at {download_filename}")
|
||||||
click.echo(f"Successfully downloaded to {download_filename}")
|
continue
|
||||||
except Exception as e:
|
|
||||||
click.echo(f"Download attempt failed: {e}")
|
attempts = 1
|
||||||
return None
|
p = None
|
||||||
|
try:
|
||||||
|
with requests.get(download_url, stream=True) as r:
|
||||||
|
while attempts <= 3:
|
||||||
|
try:
|
||||||
|
r.raise_for_status()
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3")
|
||||||
|
sleep(attempts)
|
||||||
|
attempts += 1
|
||||||
|
if r is None or r.status_code != 200:
|
||||||
|
if r:
|
||||||
|
code = r.status_code
|
||||||
|
else:
|
||||||
|
code = "-1"
|
||||||
|
raise HTTPError(download_url, code, "", None, None)
|
||||||
|
|
||||||
|
if not os.path.exists(download_path):
|
||||||
|
os.makedirs(download_path)
|
||||||
|
|
||||||
|
with open(download_filename, 'wb') as f:
|
||||||
|
for chunk in r.iter_content(chunk_size=8192):
|
||||||
|
f.write(chunk)
|
||||||
|
click.echo(f"Successfully downloaded to {download_filename}")
|
||||||
|
except Exception as e:
|
||||||
|
click.echo(f"Download attempt failed: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -158,7 +237,11 @@ def downloadSong(destination, filename, entry):
|
||||||
"-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES',
|
"-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES',
|
||||||
help="Number of pages to scan (default is all)."
|
help="Number of pages to scan (default is all)."
|
||||||
)
|
)
|
||||||
def build_database(_overwrite, _pages):
|
@click.option(
|
||||||
|
"-c", "--concurrency", "_concurrency", type=int, default=10, envvar='C3DBDL_BUILD_CONCURRENCY',
|
||||||
|
help="Number of concurrent song page downloads to perform at once."
|
||||||
|
)
|
||||||
|
def build_database(_overwrite, _pages, _concurrency):
|
||||||
"""
|
"""
|
||||||
Initialize the local JSON database of C3DB songs from the website.
|
Initialize the local JSON database of C3DB songs from the website.
|
||||||
|
|
||||||
|
@ -173,7 +256,7 @@ def build_database(_overwrite, _pages):
|
||||||
exit(1)
|
exit(1)
|
||||||
|
|
||||||
click.echo("Building JSON database; this will take a long time...")
|
click.echo("Building JSON database; this will take a long time...")
|
||||||
songs_database = buildDatabase(_pages)
|
songs_database = buildDatabase(_pages, _concurrency)
|
||||||
click.echo('')
|
click.echo('')
|
||||||
click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'")
|
click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'")
|
||||||
if not os.path.exists(config['download_directory']):
|
if not os.path.exists(config['download_directory']):
|
||||||
|
@ -267,7 +350,17 @@ def database():
|
||||||
default=None, type=int,
|
default=None, type=int,
|
||||||
help='Limit to this many songs (first N matches).'
|
help='Limit to this many songs (first N matches).'
|
||||||
)
|
)
|
||||||
def download(_filters, _limit, _file_structure):
|
@click.option(
|
||||||
|
"-i", "--download-id", "_id",
|
||||||
|
default=None, type=int,
|
||||||
|
help='Download only "dl_links" entry N (1 is first, etc.), or all if unspecified.'
|
||||||
|
)
|
||||||
|
@click.option(
|
||||||
|
"-d", "--download-descr", "_desc",
|
||||||
|
default=None,
|
||||||
|
help='Download only "dl_links" entries with this in their description (fuzzy).'
|
||||||
|
)
|
||||||
|
def download(_filters, _id, _desc, _limit, _file_structure):
|
||||||
"""
|
"""
|
||||||
Download song(s) from the C3DB webpage.
|
Download song(s) from the C3DB webpage.
|
||||||
|
|
||||||
|
@ -286,14 +379,12 @@ def download(_filters, _limit, _file_structure):
|
||||||
The default output file structure is:
|
The default output file structure is:
|
||||||
"{genre}/{author}/{artist}/{album}/{title} [{year}].{orig_name}"
|
"{genre}/{author}/{artist}/{album}/{title} [{year}].{orig_name}"
|
||||||
|
|
||||||
\b
|
|
||||||
Filters allow granular selection of the song(s) to download. Multiple filters can be
|
Filters allow granular selection of the song(s) to download. Multiple filters can be
|
||||||
specified, and a song is selected only if ALL filters match (logical AND). Each filter
|
specified, and a song is selected only if ALL filters match (logical AND). Each filter
|
||||||
is in the form:
|
is in the form "--filter [database_key] [value]".
|
||||||
--filter [database_key] [value]
|
|
||||||
|
|
||||||
\b
|
The valid "database_key" values are identical to the output file fields above, except
|
||||||
The valid "database_key" values are identical to the output file fields above.
|
for "orig_name".
|
||||||
|
|
||||||
\b
|
\b
|
||||||
For example, to download all songs in the genre "Rock":
|
For example, to download all songs in the genre "Rock":
|
||||||
|
@ -303,6 +394,13 @@ def download(_filters, _limit, _file_structure):
|
||||||
Or to download all songs by the artist "Rush" and the author "MyName":
|
Or to download all songs by the artist "Rush" and the author "MyName":
|
||||||
--filter artist Rush --filter author MyName
|
--filter artist Rush --filter author MyName
|
||||||
|
|
||||||
|
In addition to filters, each song may have more than one download link, to provide
|
||||||
|
multiple versions of the same song (for example, normal and multitracks, or alternate
|
||||||
|
charts). For each song, the "-i"/"--download-id" and "-d"/"--download-descr" options
|
||||||
|
can help filter these out, or both can be left blank to download all possible files
|
||||||
|
for a given song. Mostly useful when being extremely restrictive with filters, less
|
||||||
|
so when downloading many songs at once.
|
||||||
|
|
||||||
\b
|
\b
|
||||||
The following environment variables can be used for scripting purposes:
|
The following environment variables can be used for scripting purposes:
|
||||||
* C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure"
|
* C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure"
|
||||||
|
@ -331,7 +429,7 @@ def download(_filters, _limit, _file_structure):
|
||||||
click.echo(f"Downloading {len(pending_songs)} song files...")
|
click.echo(f"Downloading {len(pending_songs)} song files...")
|
||||||
|
|
||||||
for song in pending_songs:
|
for song in pending_songs:
|
||||||
downloadSong(config['download_directory'], _file_structure, song)
|
downloadSong(config['download_directory'], _file_structure, song, _id, _desc)
|
||||||
|
|
||||||
|
|
||||||
@click.group(context_settings=CONTEXT_SETTINGS)
|
@click.group(context_settings=CONTEXT_SETTINGS)
|
||||||
|
|
Loading…
Reference in New Issue