Improve download link handling

The previous method relied on the main "download link" in the list page. But this link was broken a solid 1/4 of the time, and far more often for some artists. Instead, during DB build, grab and parse each actual song page too, and grab from it all possible download links. Use a ThreadPoolExecutor to do this in a reasonable amount of time (default of 10 workers, but user configurable). Then when downloading, iterate over all download links, or provide some user options for filtering these by ID or description.
2023-04-06 02:06:55 -04:00
parent 3a0ef3dcc6
commit 6ec8923336
2 changed files with 203 additions and 106 deletions
--- a/README.md
+++ b/README.md
@@ -13,11 +13,6 @@ standardized format.
 To use the tool, first use the "database" command to build or modify your local JSON database, then use the
 "download" command to download songs.
 To avoid overloading or abusing the C3DB website, this tool operates exclusively in sequential mode by design; at
 most one page is scraped (for "database build") or song downloaded (for "download") at once. Additionally, the tool
 design ensures that the JSON database of songs is stored locally, so it only needs to be built once and then is
 reused to perform actual downloads without putting further load on the website.
 ## Installation
 1. Install the Python3 requirements from `requirements.txt`.
@@ -39,8 +34,9 @@ fetch all avilable songs for all games, and either specify it with the `-u`/`--b
 environment variable `C3DBDL_BASE_URL`.
 1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount
-of time to complete as all pages of the chosen base URL are scanned. Note that if you cancel this process, no
+of time to complete as all pages of the chosen base URL, and all song pages (30,000+) are scanned. Note that if
-data will be saved, so let it complete!
+you cancel this process, no data will be saved, so let it complete! The default concurrency setting should make
 this relatively quick but YMMV.
 1. Download any song(s) you want with `c3dbdl [options] download [options]`.
@@ -86,6 +82,9 @@ Downloading song "Rush - Sweet Miracle" by ejthedj...
 Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle...
 ```
 In addition to the above filters, within each song may be more than one download link. To filter these links,
 use the "-i"/"--download-id" and "-d"/"--download-descr" (see the help for details).
 Feel free to experiment.
 ## Output Format
--- a/296
+++ b/296
@@ -10,11 +10,87 @@ from difflib import unified_diff
 from colorama import Fore
 from bs4 import BeautifulSoup
 from urllib.error import HTTPError
 from concurrent.futures import ThreadPoolExecutor, as_completed
 CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120)
 def fetchSongData(entry):
    song_entry = dict()
    messages = list()
-def buildDatabase(pages=None):
+    for idx, td in enumerate(entry.find_all('td')):
        if idx == 2:
            # Artist
            song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
        elif idx == 3:
            # Song
            song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
            song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
            # Song page
            tmp_links = td.find_all('a', href=True)
            for link in tmp_links:
                if link.get('href'):
                    song_entry["song_link"] = link.get('href')
                    break
        elif idx == 4:
            # Genre
            song_entry["genre"] = td.find('a').get_text().strip()
        elif idx == 5:
            # Year
            song_entry["year"] = td.find('a').get_text().strip()
        elif idx == 6:
            # Length
            song_entry["length"] = td.find('a').get_text().strip()
        elif idx == 8:
            # Author (of chart)
            song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
    if song_entry and song_entry['author'] and song_entry['title'] and song_entry["song_link"]:
        messages.append(f"> Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
        for entry_type in ["artist", "album", "genre", "year", "length"]:
            if not song_entry[entry_type]:
                song_entry[entry_type] = "None"
        # Get download links from the actual song page
        attempts = 1
        sp = None
        while attempts <= 3:
            try:
                messages.append(f"Parsing song page {song_entry['song_link']} (attempt {attempts}/3)...")
                sp = requests.get(song_entry["song_link"])
                break
            except Exception:
                sleep(attempts)
                attempts += 1
        if sp is None or sp.status_code != 200:
            messages.append("Failed to fetch song page, aborting")
            return None
        song_parsed_html = BeautifulSoup(sp.text, 'html.parser')
        download_section = song_parsed_html.find('div', attrs={"class": "portlet light bg-inverse"})
        download_links = download_section.find_all('a', href=True)
        dl_links = list()
        for link_entry in download_links:
            link = link_entry.get('href')
            description = link_entry.get_text().strip()
            if not "c3universe.com" in link:
                continue
            messages.append(f"Found download link: {link} ({description})")
            dl_links.append({
                "link": link,
                "description": description,
            })
        if not dl_links:
            messages.append("Found no c3universe.com download links for song, not adding to database")
            return None
        song_entry["dl_links"] = dl_links
        # Append to the database
        return messages, song_entry
 def buildDatabase(pages, concurrency):
    found_songs = []
    if pages is None:
@@ -46,106 +122,109 @@ def buildDatabase(pages=None):
        table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody')
        entries = list()
        for entry in table_html.find_all('tr', attrs={'class':'odd'}):
            if len(entry) < 1:
                break
-
+            entries.append(entry)
-            song_entry = dict()
+            
-
+        click.echo("Fetching and parsing song pages...")
-            for idx, td in enumerate(entry.find_all('td')):
+        with ThreadPoolExecutor(max_workers=concurrency) as executor:
-                if idx == 1:
+            future_to_song = {executor.submit(fetchSongData, entry): entry for entry in entries}
-                    # Download link
+            for future in as_completed(future_to_song):
-                    song_entry["dl_link"] = td.find('a', attrs={'target':'_blank'}).get('href')
+                try:
-                elif idx == 2:
+                    messages, song = future.result()
-                    # Artist
+                    click.echo('\n'.join(messages))
-                    song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
+                    if song is None:
-                elif idx == 3:
+                        continue
-                    # Song
+                    found_songs.append(song)
-                    song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
+                except Exception:
-                    song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
+                    continue
                elif idx == 4:
                    # Genre
                    song_entry["genre"] = td.find('a').get_text().strip()
                elif idx == 5:
                    # Year
                    song_entry["year"] = td.find('a').get_text().strip()
                elif idx == 6:
                    # Length
                    song_entry["length"] = td.find('a').get_text().strip()
                elif idx == 8:
                    # Author (of chart)
                    song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
            if song_entry and song_entry['author'] and song_entry['title']:
                click.echo(f"Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
                for entry_type in ["artist", "album", "genre", "year", "length"]:
                    if not song_entry[entry_type]:
                        song_entry[entry_type] = "None"
                found_songs.append(song_entry)
    return found_songs
-def downloadSong(destination, filename, entry):
+def downloadSong(destination, filename, entry, dlid, dldesc):
-    click.echo(f"""Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
+    click.echo(f"""> Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
-    try:
+    if dlid is None:
-        p = requests.get(entry['dl_link'])
+        dl_links = entry['dl_links']
-        if p.status_code != 200:
+    else:
-            raise HTTPError(entry['dl_link'], p.status_code, "", None, None)
+        try:
            dl_links = [entry['dl_links'][dlid - 1]]
        except Exception:
            click.echo(f"Invalid download link ID {dlid}.")
            return
-        parsed_html = BeautifulSoup(p.text, 'html.parser')
+    if dldesc is not None:
-        download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
+        new_dl_links = list()
-    except Exception as e:
+        for entry in dl_links:
-        click.echo(f"Failed parsing or retrieving HTML link: {e}")
+            if dldesc in entry['description']:
-        return None
+                new_dl_links.append(entry)
        dl_links = new_dl_links
-    download_filename = filename.format(
+        if not dl_links:
-        genre=entry['genre'],
+            click.echo(f'No download link matching description "{dldesc}" found.')
-        artist=entry['artist'],
+            return
        album=entry['album'],
        title=entry['title'],
        year=entry['year'],
        author=entry['author'],
        orig_name=download_url.split('/')[-1],
    )
    download_filename = f"{destination}/{download_filename}"
    download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
-    if os.path.exists(download_filename):
+    for dl_link in dl_links:
-        click.echo(f"File exists at {download_filename}")
+        try:
-        return None
+            p = requests.get(dl_link['link'])
-
+            if p.status_code != 200:
-    click.echo(f"""Downloading from {download_url}...""")
+                raise HTTPError(dl_link['link'], p.status_code, "", None, None)
-    attempts = 1
+    
-    p = None
+            parsed_html = BeautifulSoup(p.text, 'html.parser')
-    try:
+            download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
-        with requests.get(download_url, stream=True) as r:
+        except Exception as e:
-            while attempts <= 3:
+            click.echo(f"Failed parsing or retrieving HTML link: {e}")
-                try:
+            continue
-                    r.raise_for_status()
+    
-                    break
+        download_filename = filename.format(
-                except Exception:
+            genre=entry['genre'],
-                    click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3")
+            artist=entry['artist'],
-                    sleep(attempts)
+            album=entry['album'],
-                    attempts += 1
+            title=entry['title'],
-            if r is None or r.status_code != 200:
+            year=entry['year'],
-                if r:
+            author=entry['author'],
-                    code = r.status_code
+            orig_name=download_url.split('/')[-1],
-                else:
+        )
-                    code = "-1"
+        download_filename = f"{destination}/{download_filename}"
-                raise HTTPError(download_url, code, "", None, None)
+        download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
-
+    
-            if not os.path.exists(download_path):
+        click.echo(f"""Downloading file "{dl_link['description']}" from {download_url}...""")
-                os.makedirs(download_path)
+        if os.path.exists(download_filename):
-
+            click.echo(f"File exists at {download_filename}")
-            with open(download_filename, 'wb') as f:
+            continue
-                for chunk in r.iter_content(chunk_size=8192):
+    
-                    f.write(chunk)
+        attempts = 1
-            click.echo(f"Successfully downloaded to {download_filename}")
+        p = None
-    except Exception as e:
+        try:
-        click.echo(f"Download attempt failed: {e}")
+            with requests.get(download_url, stream=True) as r:
-        return None
+                while attempts <= 3:
                    try:
                        r.raise_for_status()
                        break
                    except Exception:
                        click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3")
                        sleep(attempts)
                        attempts += 1
                if r is None or r.status_code != 200:
                    if r:
                        code = r.status_code
                    else:
                        code = "-1"
                    raise HTTPError(download_url, code, "", None, None)
                if not os.path.exists(download_path):
                    os.makedirs(download_path)
                with open(download_filename, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)
                click.echo(f"Successfully downloaded to {download_filename}")
        except Exception as e:
            click.echo(f"Download attempt failed: {e}")
            continue
@@ -158,7 +237,11 @@ def downloadSong(destination, filename, entry):
    "-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES',
    help="Number of pages to scan (default is all)."
 )
-def build_database(_overwrite, _pages):
+@click.option(
    "-c", "--concurrency", "_concurrency", type=int, default=10, envvar='C3DBDL_BUILD_CONCURRENCY',
    help="Number of concurrent song page downloads to perform at once."
 )
 def build_database(_overwrite, _pages, _concurrency):
    """
    Initialize the local JSON database of C3DB songs from the website.
@@ -173,7 +256,7 @@ def build_database(_overwrite, _pages):
        exit(1)
    click.echo("Building JSON database; this will take a long time...")
-    songs_database = buildDatabase(_pages)
+    songs_database = buildDatabase(_pages, _concurrency)
    click.echo('')
    click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'")
    if not os.path.exists(config['download_directory']):
@@ -267,7 +350,17 @@ def database():
    default=None, type=int,
    help='Limit to this many songs (first N matches).'
 )
-def download(_filters, _limit, _file_structure):
+@click.option(
    "-i", "--download-id", "_id",
    default=None, type=int,
    help='Download only "dl_links" entry N (1 is first, etc.), or all if unspecified.'
 )
@click.option(
    "-d", "--download-descr", "_desc",
    default=None,
    help='Download only "dl_links" entries with this in their description (fuzzy).'
 )
 def download(_filters, _id, _desc, _limit, _file_structure):
    """
    Download song(s) from the C3DB webpage.
@@ -286,14 +379,12 @@ def download(_filters, _limit, _file_structure):
    The default output file structure is:
        "{genre}/{author}/{artist}/{album}/{title} [{year}].{orig_name}"
    \b
    Filters allow granular selection of the song(s) to download. Multiple filters can be
    specified, and a song is selected only if ALL filters match (logical AND). Each filter
-    is in the form:
+    is in the form "--filter [database_key] [value]".
      --filter [database_key] [value]
-    \b
+    The valid "database_key" values are identical to the output file fields above, except
-    The valid "database_key" values are identical to the output file fields above.
+    for "orig_name".
    \b
    For example, to download all songs in the genre "Rock":
@@ -303,6 +394,13 @@ def download(_filters, _limit, _file_structure):
    Or to download all songs by the artist "Rush" and the author "MyName":
      --filter artist Rush --filter author MyName
    In addition to filters, each song may have more than one download link, to provide
    multiple versions of the same song (for example, normal and multitracks, or alternate
    charts). For each song, the "-i"/"--download-id" and "-d"/"--download-descr" options
    can help filter these out, or both can be left blank to download all possible files
    for a given song. Mostly useful when being extremely restrictive with filters, less
    so when downloading many songs at once.
    \b
    The following environment variables can be used for scripting purposes:
      * C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure"
@@ -331,7 +429,7 @@ def download(_filters, _limit, _file_structure):
    click.echo(f"Downloading {len(pending_songs)} song files...")
    for song in pending_songs:
-        downloadSong(config['download_directory'], _file_structure, song)
+        downloadSong(config['download_directory'], _file_structure, song, _id, _desc)
@click.group(context_settings=CONTEXT_SETTINGS)