Add c2dbdl script

2023-04-02 12:50:09 -04:00
commit cdc67eb114
3 changed files with 536 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,128 @@
+# C3DB Download Tool
+
+The C3DB Download Tool allows for easy scraping to a local JSON database and downloading of files from the C3
+(Customs Creators Collective) database, a collection of custom songs for Guitar Hero, Rock Band, and similar clone
+games.
+
+This tool exists because the C3DB is very hard to mass download from: each song must be found in the extensive
+list, selected manually, and a second link clicked through, before a random file name is obtained. This tool
+simplifies the process by first collecting information about all available songs of a particular type, and then is
+able to download songs based on customizable filters (e.g. by genre, artist, author, etc.) and output them in a
+standardized format.
+
+To use the tool, first use the "database" command to build or modify your local JSON database, then use the
+"download" command to download songs.
+
+To avoid overloading or abusing the C3DB website, this tool operates exclusively in sequential mode by design; at
+most one page is scraped (for "database build") or song downloaded (for "download") at once. Additionally, the tool
+design ensures that the JSON database of songs is stored locally, so it only needs to be built once and then is
+reused to perform actual downloads without putting further load on the website.
+
+## Installation
+
+1. Install the Python3 requirements from `requirements.txt`.
+
+1. Copy the script to a virtualenv, somewhere in your $PATH or execute directly from this folder (see Usage below).
+
+## Usage
+
+Before running a command, use the build-in help via the `-h`/`--help` option to view the available option(s) of
+the command.
+
+The general process of using `c3dbdl` is as follows:
+
+1. Select a download location, and either specify it with the `-d`/`--download-directory` option or via the
+environment variable `C3DBDL_DOWNLOAD_DIRECTORY`.
+
+1. Select a base URL. Use this to determine what game(s) you want to want to limit to, or use the default to
+fetch all avilable songs for all games, and either specify it with the `-u`/`--base-url` option or via the
+environment variable `C3DBDL_BASE_URL`.
+
+1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount
+of time to complete as all pages of the chosen base URL are scanned. Note that if you cancel this process, no
+data will be saved, so let it complete!
+
+1. Download any song(s) you want with `c3dbdl [options] download [options]`.
+
+## Filtering
+
+Filtering out the songs in the database is a key part of this tool. You might want to be able to grab only select
+genres, artists, authors, etc. to make your custom song packs.
+
+`c3dbdl` is able to filter by several key categories:
+
+* `genre`: The genre of the song.
+* `artist`: The artist of the song.
+* `album`: The album of the song.
+* `title`: The title of the song.
+* `year`: The year of the album/song.
+* `author`: The author of the file on C3DB.
+
+Note that we *cannot* filter - mostly for parsing difficulty reasons - by intrument type or difficulty, by song
+length, or by any other information not mentioned above.
+
+Filtering is always done during the download stage; the JSON database will always contain all possible entries.
+
+To use filters, append one or more `--filter` options to your `c3dbdl download` command. A filter option begins
+with the literal `--filter`, followed by the category (e.g. `genre` or `artist`), then finally the text to filter
+on, for instance `Rock` or `Santana` or `2012`. The text must be quoted if it contains whitespace.
+
+If more that one filter is specified, they are treated as a logical AND, i.e. all the listed filters must apply to
+a given song for it to be downloaded in that run.
+
+Filters allow powerfully specific download selections to be run. For example, let's look for all songs by Rush
+from the album Vapor Trails (the remixed version) authored by ejthedj:
+
+```
+c3dbdl download --filter artist Rush --filter album "Vapor Trails [Remixed]" --author ejthedj
+```
+
+This shouldfind , as of 2023-04-02, exactly one song, "Sweet Miracle":
+
+```
+Found 28942 songs from JSON database file 'Downloads/c3db.json'
+Downloading 1 song files...
+Downloading song "Rush - Sweet Miracle" by ejthedj...
+Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle...
+```
+
+Feel free to experiment.
+
+## Output Format
+
+When downloading files, it may be advantageous to customize the output directory and filename structure to better
+match what you plan to do with the files. For instance, for pure organiation you might want nicely laid out
+files with clear directory structures and names, while for Onyx packaging you might want everything in a flat
+directory.
+
+`c3dbdl` provides complete flexibility in the output file format. When downloading, use the `--file-structure`
+option to set the file structure. This value is an interpolated string containing one or more field variables,
+which are mapped at download file. The available fields are:
+
+* `genre`: The genre of the song.
+* `artist`: The artist of the song.
+* `album`: The album of the song.
+* `title`: The title of the song.
+* `year`: The year of the album/song.
+* `author`: The author of the file on C3DB.
+* `orig_file`: The original filename that would be downloaded by e.g. a browser.
+
+The default structure leverages all of these options to create an archive-ready structure as follows:
+
+```
+{genre}/{artist}/{album}/{title} [{year}] ({author}).{orig_file}
+```
+
+As an example:
+
+```
+Prog/Rush/Vapor Trails [Remixed]/Sweet Miracle [2002] (ejthedj).sweetMiracle
+```
+
+Note that any parent director(ies) will be automatically created down the whole tree until the final filename.
+
+## Help
+
+This is a quick and dirty tool I wrote to quickly grab collections of songs. I provide no guarantee of success
+when using this tool. If you have issues, please open an issue on this repository and provide *full details*
+of your problem.
--- a/404
+++ b/404
@ -0,0 +1,404 @@
+#!/usr/bin/env python3
+
+import click
+import requests
+import re
+import json
+import os
+from time import sleep
+from difflib import unified_diff
+from colorama import Fore
+from bs4 import BeautifulSoup
+from urllib.error import HTTPError
+
+CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120)
+
+
+def buildDatabase(pages=None):
+    found_songs = []
+
+    if pages is None:
+        r = requests.get(f"{config['base_songs_url']}")
+        if r.status_code != 200:
+            return
+
+        root_page_html = BeautifulSoup(r.text, 'html.parser')
+        pages = int(root_page_html.body.find('a', attrs={'class':'paginationLastPage'}).get('href').replace('?page=', ''))
+
+    click.echo(f"Collecting data from {pages} pages")
+
+    # Get a list of song URIs
+    for i in range(1, pages + 1):
+        attempts = 1
+        p = None
+        while attempts <= 5:
+            try:
+                click.echo(f"Parsing page {i} (attempt #{attempts})...")
+                p = requests.get(f"{config['base_songs_url']}?page={i}")
+                break
+            except Exception:
+                sleep(attempts)
+                attempts += 1
+        if p is None or p.status_code != 200:
+            break
+
+        parsed_html = BeautifulSoup(p.text, 'html.parser')
+
+        table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody')
+
+        for entry in table_html.find_all('tr', attrs={'class':'odd'}):
+            if len(entry) < 1:
+                break
+
+            song_entry = dict()
+
+            for idx, td in enumerate(entry.find_all('td')):
+                if idx == 1:
+                    # Download link
+                    song_entry["dl_link"] = td.find('a', attrs={'target':'_blank'}).get('href')
+                elif idx == 2:
+                    # Artist
+                    song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
+                elif idx == 3:
+                    # Song
+                    song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
+                    song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
+                elif idx == 4:
+                    # Genre
+                    song_entry["genre"] = td.find('a').get_text().strip()
+                elif idx == 5:
+                    # Year
+                    song_entry["year"] = td.find('a').get_text().strip()
+                elif idx == 6:
+                    # Length
+                    song_entry["length"] = td.find('a').get_text().strip()
+                elif idx == 8:
+                    # Author (of chart)
+                    song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
+
+            if song_entry and song_entry['title']:
+                click.echo(f"Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
+                found_songs.append(song_entry)
+
+    return found_songs
+
+
+def downloadSong(destination, filename, entry):
+    click.echo(f"""Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
+
+    try:
+        p = requests.get(entry['dl_link'])
+        if p.status_code != 200:
+            raise HTTPError(entry['dl_link'], p.status_code, "", None, None)
+
+        parsed_html = BeautifulSoup(p.text, 'html.parser')
+        download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
+    except Exception as e:
+        click.echo(f"Failed parsing or retrieving HTML link: {e}")
+        return None
+
+    download_filename = filename.format(
+        genre=entry['genre'],
+        artist=entry['artist'],
+        album=entry['album'],
+        title=entry['title'],
+        year=entry['year'],
+        author=entry['author'],
+        orig_name=download_url.split('/')[-1],
+    )
+    download_filename = f"{destination}/{download_filename}"
+    download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
+
+    if not os.path.exists(download_path):
+        os.makedirs(download_path)
+
+    if os.path.exists(download_filename):
+        click.echo(f"File exists at {download_filename}")
+        return None
+
+    click.echo(f"""Downloading from {download_url}...""")
+    attempts = 1
+    p = None
+    try:
+        with requests.get(download_url, stream=True) as r:
+            while attempts <= 5:
+                try:
+                    r.raise_for_status()
+                    break
+                except Exception:
+                    click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/5")
+                    sleep(attempts)
+                    attempts += 1
+            if r is None or r.status_code != 200:
+                if r:
+                    code = r.status_code
+                else:
+                    code = "-1"
+                raise HTTPError(download_url, code, "", None, None)
+            with open(download_filename, 'wb') as f:
+                for chunk in r.iter_content(chunk_size=8192):
+                    f.write(chunk)
+            click.echo(f"Successfully downloaded to {download_filename}")
+    except Exception as e:
+        click.echo(f"Download attempt failed: {e}")
+        return None
+
+
+
+@click.command(name='build', short_help='Build the local database.')
+@click.option(
+    "-o", "--overwrite", '_overwrite', is_flag=True, default=False, envvar='C3DLDB_BUILD_OVERWRITE',
+    help="Overwrite existing database file."
+)
+@click.option(
+    "-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES',
+    help="Number of pages to scan (default is all)."
+)
+def build_database(_overwrite, _pages):
+    """
+    Initialize the local JSON database of C3DB songs from the website.
+
+    \b
+    The following environment variables can be used for scripting purposes:
+      * C3DLDB_BUILD_OVERWRITE: equivalent to "--overwrite"
+      * C3DBDL_BUILD_PAGES: equivalent to "--pages"
+    """
+
+    if os.path.exists(config['database_filename']) and not _overwrite:
+        click.echo(f"Database already exists at '{config['database_filename']}'; use '--overwrite' to rebuild.")
+        exit(1)
+
+    click.echo("Building JSON database; this will take a long time...")
+    songs_database = buildDatabase(_pages)
+    click.echo('')
+    click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'")
+    if not os.path.exists(config['download_directory']):
+        click.echo(f"Creating download directory '{config['download_directory']}'")
+        os.makedirs(config['download_directory'])
+    with open(config['database_filename'], "w") as fh:
+        json.dump(songs_database, fh, indent=2)
+        fh.write('\n')
+
+
+@click.command(name='edit', short_help='Edit the local database in EDITOR.')
+def edit_database():
+    """
+    Edit the local JSON database of C3DB songs with your $EDITOR.
+    """
+
+    if not os.path.exists(config['database_filename']):
+        click.echo(f"WARNING: Database filename '{config['database_filename']}' does not exist!")
+        click.echo("Ensure you build a database first with the 'database build' command.")
+        exit(1)
+
+    with open(config['database_filename'], "r") as fh:
+        songs_database = fh.read()
+
+    new_songs_database = click.edit(text=songs_database, require_save=True, extension='.json')
+    while True:
+        if new_songs_database is None:
+            click.echo("Aborting with no modifications")
+            exit(0)
+       
+        click.echo('')
+        click.echo("Pending modifications:")
+        click.echo('')
+        diff = list(unified_diff(
+                                 songs_database.split('\n'),
+                                 new_songs_database.split('\n'),
+                                 fromfile='current',
+                                 tofile='modified',
+                                 fromfiledate='',
+                                 tofiledate='',
+                                 n=3,
+                                 lineterm=''))
+        for line in diff:
+            if re.match(r'^\+', line) is not None:
+                click.echo(Fore.GREEN + line + Fore.RESET)
+            elif re.match(r'^\-', line) is not None:
+                click.echo(Fore.RED + line + Fore.RESET)
+            elif re.match(r'^\^', line) is not None:
+                click.echo(Fore.BLUE + line + Fore.RESET)
+            else:
+                click.echo(line)
+        click.echo('')
+
+        try:
+            json.loads(new_songs_database)
+            break
+        except Exception:
+            click.echo('ERROR: Invalid JSON syntax.')
+            click.confirm('Continue editing?', abort=True)
+            new_songs_database = click.edit(text=new_songs_database, require_save=True, extension='.json')
+
+    click.confirm('Write modifications to songs database?', abort=True)
+
+    with open(config['database_filename'], "w") as fh:
+        fh.write(new_songs_database)
+
+
+@click.group(name="database", short_help='Manage the local database.')
+def database():
+    """
+    Manage the local JSON database of C3DB songs.
+    """
+
+    pass
+
+
+@click.command(name="download", short_help='Download files from C3DB.')
+@click.option(
+    '-s', '--file-structure', '_file_structure', envvar='C3DBDL_DL_FILE_STRUCTURE',
+    default="{genre}/{artist}/{album}/{title} [{year}] ({author}).{orig_name}",
+    help='Specify the output file/directory stucture.'
+)
+@click.option(
+    '-f', '--filter', '_filters', envvar='C3DBDL_DL_FILTERS',
+    default=[], multiple=True,
+    nargs=2,
+    help='Add a filter option.'
+)
+@click.option(
+    '-l', '--limit', '_limit', envvar='C3DBDL_DL_LIMIT',
+    default=None, type=int,
+    help='Limit to this many songs (first N matches).'
+)
+def download(_filters, _limit, _file_structure):
+    """
+    Download song(s) from the C3DB webpage.
+
+    \b
+    The output file structure can be specified as a path format with any of the following
+    fields included, surrounded by curly braces:
+      * genre: The genre of the song.
+      * artist: The artist of the song.
+      * album: The album of the song.
+      * title: The title of the song.
+      * year: The year of the album/song.
+      * author: The author of the file on C3DB.
+      * orig_name: The original filename from the website.
+
+    \b
+    The default output file structure is:
+        "{genre}/{artist}/{album}/{title} [{year}] ({author}).{orig_file}"
+
+    \b
+    Filters allow granular selection of the song(s) to download. Multiple filters can be
+    specified, and a song is selected only if ALL filters match (logical AND). Each filter
+    is in the form:
+      --filter [database_key] [value]
+
+    \b
+    The valid "database_key" values are identical to the output file fields above.
+
+    \b
+    For example, to download all songs in the genre "Rock":
+      --filter genre Rock
+
+    \b
+    Or to download all songs by the artist "Rush" and the author "MyName":
+      --filter artist Rush --filter author MyName
+
+    \b
+    The following environment variables can be used for scripting purposes:
+      * C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure"
+      * C3DBDL_DL_FILTERS: equivalent to "--filter"; limited to one instance
+      * C3DBDL_DL_LIMIT: equivalent to "--limit"
+    """
+
+    with open(config['database_filename'], "r") as fh:
+        all_songs = json.load(fh)
+    click.echo(f"Found {len(all_songs)} songs from JSON database file '{config['database_filename']}'")
+
+    pending_songs = list()
+
+    for song in all_songs:
+        if len(_filters) < 1:
+            add_to_pending = True
+        else:
+            add_to_pending = all(song[_filter[0]] == _filter[1] for _filter in _filters)
+
+        if add_to_pending:
+            pending_songs.append(song)
+
+    if _limit is not None:
+        pending_songs = pending_songs[0:_limit]
+
+    click.echo(f"Downloading {len(pending_songs)} song files...")
+
+    for song in pending_songs:
+        downloadSong(config['download_directory'], _file_structure, song)
+
+
+@click.group(context_settings=CONTEXT_SETTINGS)
+@click.option(
+    '-u', '--base-url', '_base_url', envvar='C3DBDL_BASE_URL',
+    default='https://db.c3universe.com/songs/all', show_default=True,
+    help='Base URL of the online C3DB songs page'
+)
+@click.option(
+    '-d', '--download-directory', '_download_directory', envvar='C3DBDL_DOWNLOAD_DIRECTORY',
+    default='~/Downloads', show_default=True,
+    help='Download directory for JSON database and songs'
+)
+@click.option(
+    '-j', '--json-database', '_json_database', envvar='C3DBDL_JSON_DATABASE',
+    default='c3db.json', show_default=True,
+    help='JSON database filename within download directory'
+)
+def cli(_base_url, _download_directory, _json_database):
+    """
+    C3DB Download Tool
+
+    The C3DB Download Tool allows for easy scraping to a local JSON database and downloading
+    of files from the C3 (Customs Creators Collective) database, a collection of custom songs
+    for Guitar Hero, Rock Band, and similar clone games.
+
+    This tool exists because the C3DB is very hard to mass download from: each song must
+    be found in the extensive list, selected manually, and a second link clicked through,
+    before a random file name is obtained. This tool simplifies the process by first collecting
+    information about all available songs of a particular type, and then is able to download
+    songs based on customizable filters (e.g. by genre, artist, author, etc.) and output them
+    in a standardized format.
+    
+    To use the tool, first use the "database" command to build or modify your local JSON
+    database, then use the "download" command to download songs.
+
+    To avoid overloading or abusing the C3DB website, this tool operates exclusively in
+    sequential mode by design; at most one page is scraped (for "database build") or song
+    downloaded (for "download") at once. Additionally, the tool design ensures that the JSON
+    database of songs is stored locally, so it only needs to be built once and then is reused
+    to perform actual downloads without putting further load on the website.
+
+    \b
+    The following environment variables can be used for scripting purposes:
+      * C3DBDL_BASE_URL: equivalent to "--base-url"
+      * C3DBDL_DOWNLOAD_DIRECTORY: equivalent to "--download_directory"
+      * C3DBDL_JSON_DATABASE: equivalent to "--json-database"
+
+    """
+    
+    global config
+
+    # Expand any ~ in the download directory pathname
+    _download_directory = os.path.expanduser(_download_directory)
+
+    # Populate the configuration store
+    config['base_songs_url'] = _base_url
+    config['download_directory'] = _download_directory
+    config['database_filename'] = f"{_download_directory}/{_json_database}"
+
+
+config = dict()
+
+database.add_command(build_database)
+database.add_command(edit_database)
+
+cli.add_command(database)
+cli.add_command(download)
+
+def main():
+    return cli(obj={})
+
+if __name__ == '__main__':
+    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,4 @@
+Click
+requests
+colorama
+beautifulsoup4