What is PFA?
PFA is not: | PFA is: |
---|---|
A data pipeline framework, like Hadoop or Spark. | A specification for data analytics, the math that gets computed in Hadoop or Spark (or other frameworks). |
A math library, like Apache Commons Math or Numpy. | A specification for expressing mathematical algorithms, which an implementation like Commons Math or Numpy can perform in response to a PFA-formatted request. |
A graphical math formatting language, like LaTeX or MathML. | PFA is unconcerned with presenting math visually, only computing it. |
A general-purpose programming language, like Java or Python. | A file format for expressing algorithms that might have been written in Java or Python and then converted to PFA. Whereas human-oriented languages like Java or Python have a complex syntax and provide almost total control over a computer, including access to operating system calls, the file system, network ports, etc., PFA's syntax is a subset of JSON and it only provides the ability to do math. The external system must feed data into PFA and interpret the results that come out. PFA provides separation between control system code (written in a general-purpose programming language) and mathematical algorithms, so that the mathematical algorithms can be tested and versioned more rapidly than the whole system. |
Virtual bytecode, like LLVM or JVM executables. | While PFA is intended as an intermediate format for moving executable algorithms from one language or environment to another, it operates at a much higher level than bytecode. The names in JSON are human-readable and describe operations such as finding nearest clusters, walking through decision trees, and manipulating matrices. |
A data mining/machine learning toolkit, like R, Scikit-Learn, MLlib, Mahout, Breeze... | Although PFA can be used to build models from training data ("fit" in R), its primary purpose is to apply fitted models ("predict" in R) in a formal or large-scale production. Typically, one would use a data mining/machine learning toolkit to train a model, then convert it to PFA to get it into production. The production environment is usually more restricted than the development environment, since it is capable of performing business actions and must therefore be secure. Some of these models might update internal state while performing their calculations, which is not completely distinct from training, so PFA has this ability and defines rules for distributed model updates. |
PMML, the Predictive Model Markup Language | PFA's purpose is similar to PMML's, namely to provide an interchange format for data mining models (and other mathematical algorithms), but there are several structural differences.
|
Source