Parallel computing solves a computational problem by breaking it apart into discrete subtasks and distributing them among multiple servers on which they will be implemented concurrently. Each subtask returns its own computational result to the master task for combination. Through the simultaneous use of multiple compute resources, parallel computing can effectively enhance the computational speed and the data processing ability.
There are two forms of parallel computing – the multithreading executed on a single computer and the parallel system established through a computer cluster consisting of a network of multiple stand-alone computers. Nowadays when we talk about parallel computing, usually it is the latter that we refer to.
This article discusses the configuration of parallel servers in esProc and their application, giving you more basic ideas about the parallel computing.
1. esProc parallel server
In a parallel system, a server, or a process, that gives instructions is the main node. Other servers that receive the instructions are sub-nodes. During computation each server in this system uses an independent process to execute program instructions. Parallel servers in operation receive computational instructions, execute local cellset files accordingly and return results to the main node. In a computer cluster, a server can be run on multiple computers and a computer can have several servers running; each server permits multiple threads to execute its computational subtask. All running servers constitute the parallel computing server system.
esProc provides server class – com.raqsoft.parallel.UnitServer – to get addresses and ports according to configuration files and to launch the parallel servers. A computer is allowed to run multiple UnitServer processes.
esProc parallel system hasn’t a certain single “manager” node centrally conducting a group of “worker” nodes. An available computer cluster is designated to serve as the provisional nodes for execution of each parallel task.
Each parallel task has its logical center – the main node – from which instructions are given and to which results are returned for being combined. The task fails if the main node malfunctions; but if one of the sub-nodes malfunctions, the main node will redistribute the subtask to another one that is available. To know more about the execution of parallel tasks in esProc, see esProc Parallel Computing: Cluster Computing.
The data file needed by a sub-node to perform its own task should be stored in this sub-node beforehand. We can store the data in multiple sub-nodes, which is called redundant storage, so that all of them have the potential to perform the same subtask. The main node selects an idle sub-node to which the subtask will be distributed. Besides, redundant storage guarantees that, when a sub-node breaks down due to disasters caused by system failure, memory over flow and network anomaly, etc., there are able sub-nodes to one of which the main node can redistribute the subtask. Therefore the use of redundant storage ensures both fault-tolerance and the steadiness in performance. About the design of redundancy, the article esProc Parallel Computing: Data Redundancy will cover more details.
Data can also be stored on a Network File System (NFS), such as HDFS, over which they can be accessed by nodes. The NFS management is simpler than storing data on nodes for fault-tolerance. But compared with accessing files stored locally, it may sacrifice some performance due to the network transmission.
2. The launch
Run the esprocs.exe file under esProc installation directory’s esProc\bin path to launch parallel servers. The jars needed by the file will be automatically loaded under the installation directory. Here one thing to note is that configuration files – config.xml and unit.xml – and log configuration properties file must be under esProc installation directory’s esProc\config path.The setting of configuration files will be introduced in the next section.
Click on Start on UnitServer window to run parallel servers. Run-time information will be written to the console according to the log configuration.
At the launch of UnitServer, we also need to specify the server’s main path through the parameter start.home and ensure that the parallel server configuration files exist in the config directory under this main path. For example:
On the above command prompt, parameter java.ext.dirs is set to specify the loading path for Java’s jar and parameter start.home to specify the main path of parallel servers. Other parameters can be set too, for instance user.language specifies which language is to be used. The multiple server threads initiated in a single computer can either share the same main path and configuration files or have their respective main path and configuration files specified for them.
Under Linux, run startunit.sh to launch the parallel service class:
3. Configuration files
As mentioned in the above section, three configuration files are required to run parallel servers. They are config.xml, unit.xml and the log configuration file.
config.xml file comes first. The file contains esProc’s basic configuration information, like registration code, searching path, main path and data source configuration, etc. It resides on the esProc\config path of esProc’s installation directory. Click Tool>Options on the menu bar to configure config.xml on the Environment page, as shown in the following figure:
For the parallel computing, a dfx file can be called directly by its name, rather than by the full path, wherever it exists – the main path or any directory of the searching path.
For the data source configuration, click Tool>Datasource connection on the menu bar, then view or modify the data sources in the data source manager. To know details about the configuration, see esProc and Databases: Database Configuration.
Then comes another configuration file for parallel servers – unit.xml.
Names of the two configuration files must not be changed and the files should be placed under the config directory of home specified at the launch of servers.
The following is the configuration information of unit.xml:
<?xml version=”1.0″ encoding=”UTF-8″?>
<Server>
<TempTimeOut>600</TempTimeOut>
<Interval>6</Interval>
<ProxyTimeOut>30</ProxyTimeOut>
<UnitList loggerProperties=”raqsoftLog.properties”>
<Unit host=”192.168.0.86″ port=”4001″ nodeTask=”4″ callxTask=”2″/>
<Unit host=”192.168.0.86″ port=”4004″ nodeTask=”4″ callxTask=”4″/>
<Unit host=”192.168.1.84″ port=”4001″ nodeTask=”4″ callxTask=”1″/>
<Unit host=”192.168.0.86″ port=”4006″ nodeTask=”4″ callxTask=”2″/>
</UnitList>
</Server>
The sub-node list UnitList includes all local machine’s IP addresses and ports that can potentially be used to run the servers. Servers will automatically search the list for the vacant IP address and ports at the launch. Note that the IP addresses should be the local machine’s real IP addresses. You can set multiple IP addresses when using multiple network adapters. TempTimeOut specifies the length of the life cycle of a temporary file (measured by seconds); interval is the time interval between two expiration checks, its value must be positive; proxyTimeout is the proxy life cycle, i.e. the life cycle of remote cursor and task space (measured by seconds). If either TempTimeOut or proxyTimeout is not given a value, or if either of them is 0, do not perform the expiration check.
In the allocation of sub-nodes, host is the server’s IP address, port is the specified port number and nodeTask is the number of computers running in parallel on which a server allows to be run. callxTask is the maximum number of parallel subtasks each callx is allowed to run on a sub-node and typically should be not greater than the number of CPU cores a physical machine has. This number of parallel subtasks is usually less than the number of computers running in parallel.
loggerProperties is the server’s log configuration properties file, in which log levels and other information can be set. raqsoftLog.properties – the log configuration properties file is as follows:
// Log levels include OFF, ERROR, WARN, INFO, DEBUG and ALL, whose priority levels decrease from left to right. Level OFF is intended to turn off logging
// Log messages. Level INFO indicates that messages of levels ERROR, WARN and INFO will be exported; each level exports messages likewise
// Specify name and log level for Logger
// Format: Level (can be omitted; default level is INFO), log name 1, log name 2
Logger=LOG1
// Output log messages to the console. There are only two forms of message logging: console and file
LOG1=Console
// Log message level; messages whose levels are lower than it in priority will be ignored. If it is omitted, level INFO is used by default
LOG1.Level=DEBUG
// Output logs to the specified file
// Specify the full path of LOG2. Default full path is the application’s current working path
LOG2=C:/raqsoft.log
// Default mode of logging is appending if not specified
LOG2.Append=true
// Maximum number of bytes in a log file. The default value is infinite if not specified
LOG2.MaxFileSize=10MB
// Maximum number of backup files. The default value is 1 if not specified
LOG2.MaxBackupIndex=2
//LOG2.Level=DEBUG
4. Data storage
esProc provides storage service to manage parallel servers in a cluster and the subtasks run on every server.
Under esProc installation directory’s esProc\bin path, click on datastore.exe file (datastore.sh under Linux) to open a window as follows:
On the Data Store interface, click node search button to search the current cluster for servers and list each of them. Select an operating server and a list of subtasks run on it will be displayed on the right part of the interface. You are allowed to force a termination of a certain subtask. When the subtask is aborted, the operation performed on the main node stops too.
5. Application
callx instruction is used in a cellset to distribute subtasks among running servers. parallel01.dfx is such a cellset:
A | |
1 | =file(“D:/files/txt/PersonnelInfo.txt”) |
2 | =A1.import@tz(;,pPart:pAll) |
3 | =A2.select(State==pState) |
4 | return A3 |
This piece of program imports a segment of data from the personnel information file PersonnelInfo.txt and selects from them data of employees from the specified state. The cellset parameters used in it are as follows:
The master program invokes parallel.dfx through cluster computing to find out all employees from Ohio:
A | |
1 | [192.168.0.86:4001,192.168.0.86:4004,192.168.1.84:4001] |
2 | =callx(“D:/files/dfx/parallel01.dfx”,”OH”,to(20),20;A1) |
3 | =A2.conj() |
A1 specifies a list of parallel servers for computing. A2 invokes these servers to execute the parallel computing. When executed, A2’s result is as follows:
It can be seen that each parallel server returns a result set, thus A2’s result is a sequence composed of multiple record sequences. A3 joins all results together:
Through this form of parallel computing, master program divides a complicated computational task or an operation involving big data into multiple subtasks, distributes them to multiple servers to compute separately and then joins every result together. The topic of parallel computing will continue in the article esProc Parallel Computing: Cluster Computing.