schrodinger.application.bb_database.bb_task_utils module

Performs build, describe, qsplit and qrun tasks for bb_database_driver.py.

The following functions may be called in lieu of running a bb_database job when the database is locally accessible:

Function bb_database Arguments ——– ——————— build_database build [options] <bbkeys> <dbname>.bbdb rebuild_chunk build -rebuild <chunkdb> [options] <bbkeys> <dbname>.bbdb describe_database describe [options] <dbname>.bbdb split_query qsplit [options] <bbkeys> <dbname>.bbdb run_query qrun [options] <query>.bbq <dbname>.bbdb

Copyright Schrodinger LLC, All Rights Reserved.

schrodinger.application.bb_database.bb_task_utils.add_chunk_row(dbpath: str, chunk_index: int, collector: phase.BBCollector) None

Adds a row to <dbpath>/chunks.csv.

schrodinger.application.bb_database.bb_task_utils.build_database(dbpath: str, key_files: list[str], newdb: bool, chunk_size: Optional[int] = 10000000, max_chunks: Optional[int] = None, commit_size: Optional[int] = 1000000, key_column: Optional[str] = 'InChIKey', key_substr: Optional[str] = ':', logger: Optional[logging.Logger] = None) None

Builds a new database or adds chunks to an existing database. May be called in lieu of running a bb_database build job.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • key_files – CSV files (.csv, .csv.gz, .csvgz) with building block keys

  • newdb – Whether to create a new database

  • chunk_size – Number of building blocks per chunk database. Ignored if newdb is False.

  • max_chunks – Maximum number of database chunks to add. The default is to add chunks until no more building blocks remain.

  • commit_size – Number of rows added to a chunk database per commit

  • key_column – Name of the column that holds building block keys

  • key_substr – <min>:<max> slice of building block key field. If the key field looks like ‘InChIKey=VXDCVOCMBGPXPL-UHFFFAOYSA-N’ key_substr should be ‘9:’

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.create_new_database(dbpath: str, chunk_size: Optional[int] = 10000000, logger: Optional[logging.Logger] = None) None

Creates a new, empty database.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • chunk_size – Number of building blocks per chunk database

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.collect_bbkeys(key_files: list[str], chunk_size: Optional[int] = 10000000, lower_bound: Optional[str] = '', key_column: Optional[str] = 'InChIKey', key_substr: Optional[str] = ':', logger: Optional[logging.Logger] = None) phase.BBCollector

Makes a pass through the key files and returns sorted capped keys in a BBCollector object.

Parameters
  • key_files – CSV files (.csv, .csv.gz, .csvgz) with building block keys

  • chunk_size – Cap on the number of sorted building blocks

  • lower_bound – Only keys > lower_bound are added

  • key_column – Name of the column that holds building block keys

  • key_substr – <min>:<max> slice of building block key field

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.describe_database(dbpath: str, verbose: Optional[bool] = False) str

Returns a string that describes the database contents. May be called in lieu of running a bb_database describe job.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • verbose – Whether to include information about each chunk

schrodinger.application.bb_database.bb_task_utils.get_chunk_file_name(dbpath: str, collector: phase.BBCollector) str

Returns the name of the chunk database file to which the building block keys in the provided BBCollector should be written.

schrodinger.application.bb_database.bb_task_utils.get_chunk_info(dbpath: str) tuple[int, str, int]

Given the path to an existing database, this funcion returns a tuple of the chunk size, lower bound on new building block keys to add, and the 1-based index for the next chunk to add.

schrodinger.application.bb_database.bb_task_utils.get_chunk_row(dbpath: str, low_key: str, high_key: str) list[str]

Returns a row from chunks.csv based on the low and high key values.

schrodinger.application.bb_database.bb_task_utils.get_chunk_rows(dbpath: str, want_header_row: Optional[bool] = False) list[list[str]]

Returns the rows in chunks.csv.

schrodinger.application.bb_database.bb_task_utils.get_key_limits(chunkdb: str) list[str, str]

Determines the low and high building block key values from the name of a chunk database file. The basename of chunkdb should be of the form <lowkey>_<highkey>.chkdb, where <lowkey> and <highkey> are the first and last building block keys in the chunk database. It is assumed that building block keys do not contain underscores.

schrodinger.application.bb_database.bb_task_utils.get_settings(dbpath: str) dict[str, str]

Reads settings.json file and returns the settings as a dict.

schrodinger.application.bb_database.bb_task_utils.log_msg(msg: str, logger: logging.Logger) None

Writes a message to a logger if it exists

schrodinger.application.bb_database.bb_task_utils.read_query_catalog(query_catalog: str) dict[str, list[str]]

Reads a query catalog file created by split_query() and returns a dictionary that maps each host name to a list of chunkdb files to be queried.

schrodinger.application.bb_database.bb_task_utils.read_split_query(query_file: str) dict[str, list[str]]

Reads a .bbq query file created by split_query() and returns a dictionary that maps chunk database name to [<key_column>, <key1>, <key2>, etc.].

schrodinger.application.bb_database.bb_task_utils.rebuild_chunk(dbpath: str, chunkdb: str, key_files: list[str], commit_size: Optional[int] = 1000000, key_column: Optional[str] = 'InChIKey', key_substr: Optional[str] = ':', logger: Optional[logging.Logger] = None) None

Rebuilds a specific chunk database that’s damaged or incomplete. May be called in lieu of running a bb_database build -rebuild job.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • chunkdb – Chunk database file (.chkdb) to rebuild. Only the base name is used.

  • key_files – CSV files (.csv, .csv.gz, .csvgz) with building block keys

  • commit_size – Number of rows added to chunk database per commit

  • key_column – Name of the column that holds building block keys

  • key_substr – <min>:<max> slice of building block key field.

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.run_query(dbpath: str, query_file: str, matches_file: str, select_size: Optional[int] = 1000000, logger: Optional[logging.Logger] = None) None

Runs a query created by split_query(). May be called in lieu of running a bb_database qrun job.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • query_file – Query file (.bbq) created by split_query()

  • matches_file – Output CSV file (.csv, .csv.gz, .csvgz) for matching keys

  • select_size – Maximum number of building block keys per database SELECT statement

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.split_query(dbpath: str, key_file: str, prefix: str, key_column: Optional[str] = 'InChIKey', key_substr: Optional[str] = ':', logger: Optional[logging.Logger] = None) None

Splits a query according to the database chunks it covers. May be called in lieu of running a bb_database qsplit job.

Parameters
  • dbpath – Absolute path to database (.bbdb)

  • key_file – CSV file (.csv, .csv.gz, .csvgz) with building block keys

  • prefix – Prefix for query files to create.

  • key_column – Name of the column that holds building block keys

  • key_substr – <min>:<max> slice of building block key field.

  • logger – Logger for informative messages

schrodinger.application.bb_database.bb_task_utils.unzip_query(prefix: str, dest_dir: Optional[str] = None) dict[str, list[str]]

Unzips an archive <prefix>_queries.zip created by zip_query() to the specified directory, which is CWD by default. Returns a dictionary that maps each extracted query file [<dest_dir>/]<prefix>_<host>.bbq to to the list [<host>, <dbpath>], where <host> is the name of the host on on which the query should be run, and <dbpath> is the location of the building block database on that host.

schrodinger.application.bb_database.bb_task_utils.validate_dbpath(dbpath: str, expected_to_exist: Optional[bool] = True) None

Raises a RuntimeError if dbpath has the wrong extension or is not an absolute path. Raises a FileExistsError if expected_to_exist is False and dbpath exists; raises a FileNotFoundError if expected_to_exist is True and dbpath doesn’t exist.

schrodinger.application.bb_database.bb_task_utils.validate_global_database(dbpath: str) None

Raises a RuntimeError if the database does not have global scope, if the rows in chunk.csv are not sequentially numbered starting at 1, or if any row has the wrong number of fields.

schrodinger.application.bb_database.bb_task_utils.validate_key_files(key_files: list[str]) None

Raises a RuntimeError if any of the provided key files do not have a recognized CSV extension. Raises a FileNotFoundError if a file is missing.

schrodinger.application.bb_database.bb_task_utils.write_readme_file(dbpath: str) None

Creates a README.txt file with important information about building block databases.

schrodinger.application.bb_database.bb_task_utils.zip_query(prefix: str) str

Given the prefix that was supplied to split_query(), this function creates the Zip archive <prefix>_queries.zip, adds <prefix>_catalog.json to the archive, along with all of its associated .bbq files. Returns the name of the archive.