Scripting with Python for Spark
SPSS Modeler can run Python scripts using the Apache Spark framework to process data. This documentation provides the Python API description for the interfaces provided.
The SPSS Modeler installation includes a Spark distribution.
Accessing data
inputData = asContext.getSparkInputData()
asContext.setSparkOutputData(outputData)
outputData = sqlContext.createDataFrame(rdd)
Defining the data model
A node that produces data must also define a data model that describes the fields visible downstream of the node. In Spark SQL terminology, the data model is the schema.
A Python/Spark script defines its output data model in the form of a
pyspsark.sql.types.StructType
object. A StructType
describes a row
in the output data frame and is constructed from a list of StructField
objects.
Each StructField
describes a single field in the output data model.
:schema
attribute of
the input data frame:inputSchema = inputData.schema
StructField
constructor:field = StructField(name, dataType, nullable=True, metadata=None)
Refer to your Spark documentation for information about the constructor.
You must provide at least the field name and its data type. Optionally, you can specify metadata to provide a measure, role, and description for the field (see Data metadata).
DataModelOnly mode
SPSS Modeler needs to know the output data model for a node,
before the node runs, to enable downstream editing. To obtain the output data model for a
Python/Spark node, SPSS Modeler runs the script in a special
data model only mode where there is no data available. The script can identify this
mode using the isComputeDataModelOnly
method on the Analytic Server context
object.
if asContext.isComputeDataModelOnly():
inputSchema = asContext.getSparkInputSchema()
outputSchema = ... # construct the output data model
asContext.setSparkOutputSchema(outputSchema)
else:
inputData = asContext.getSparkInputData()
outputData = ... # construct the output data frame
asContext.setSparkOutputData(outputData)
Building a model
A node that builds a model must return to the execution context some content that describes the model sufficiently that the node which applies the model can recreate it exactly at a later time.
Model content is defined in terms of key/value pairs where the meaning of the keys and the values is known only to the build and score nodes and is not interpreted by SPSS Modeler in any way. Optionally the node may assign a MIME type to a value with the intent that SPSS Modeler might display those values which have known types to the user in the model nugget.
asContext.setModelContentFromString(key, value, mimeType=None)
value = asContext.getModelContentToString(key)
asContext.setModelContentFromPath(key, path)
Note that in this case there is no option to specify a MIME type because the bundle may contain various content types.
path = asContext.createTemporaryFolder()
path = asContext.getModelContentToPath(key)
Error handling
spss.pyspark.exceptions
. For
example:from spss.pyspark.exceptions import ASContextException
if ... some error condition ...:
raise ASContextException("message to display to user")