Skip to content

Spark

toksearch.backend.spark.ToksearchSparkConfig dataclass

Configuration for the Spark backend

Parameters:

Name Type Description Default
sc Optional[SparkContext]

SparkContext to use. If not provided, a default SparkContext will be created.

None
numparts Optional[int]

Number of partitions to use. If not provided, defaults to the number of records. will be used.

None
cache bool

Whether to cache the RDD. Default is False.

False

toksearch.backend.spark.SparkRecordSet

Bases: RecordSet

do_cache = cache instance-attribute

rdd = rdd instance-attribute

__getitem__(index)

__init__(rdd, cache=False)

__iter__()

__len__()

cache()

cleanup(immediate=False)

Shut down the SparkContext.

Parameters:

Name Type Description Default
immediate

Whether to shut down the SparkContext immediately. If false (default), the data will be collected before shutting down the SparkContext and remain accessible after the SparkContext is stopped.

False

from_records(records, config=None) classmethod

Create a SparkRecordSet from a list of records.

Parameters:

Name Type Description Default
records List[Record]

List of records to create the RecordSet from.

required
config Optional[ToksearchSparkConfig]

Configuration for the Spark backend.

None

Returns:

Name Type Description
SparkRecordSet SparkRecordSet

The record set

map(*operations)

to_rdd(**kwargs)