Software and hardware design examples of multi-CPU parallel computer system

Multi-CPU parallel computer technology has greatly improved the computing speed of the system, breaking through the limit of single CPU processing speed. At the same time, the single-board computer design with multiple CPUs can reduce the volume of the computer system, the development cost and the development cycle of the system.
1. Hardware design of multiprocessor parallel computer system
Multi-processor parallel computer system is a system model with parallel structure. Each processor needs to have its own local memory to store its own application programs and be able to perform independent high-speed parallel computing. At the same time, the system needs an interconnection network with high-speed communication, which can distribute parallel data blocks in the local memory of each processor at high speed to improve the efficiency of the parallel system. The computer architecture design can adopt loosely coupled asymmetric processor configuration interconnected by shared memory (dual-port RAM). The system structure is shown in Figure 1. Each processor in the figure has its own high-speed local memory, which can perform high-speed independent parallel calculations. Each processor is interconnected by dual-port memories to form a high-speed star communication network. Because dual-port memories have high communication speed and flexible communication protocol establishment methods, loosely coupled multi-CPU parallel computers interconnected by dual-port memories have the following advantages:
(1) Communication bandwidth. CPU can access dual-port memory in byte/word/double-word length, and the data read/write speed is high.
⑵ Simple structure. And the processor and the dual-port memory are directly connected, and no other interface circuits are needed, so that reliable bidirectional information transmission can be realized.
(3) It can be cut. The number of processors can be increased or decreased as required.
(4) Strong expansibility. The system architecture can be suitable for various processors.
In the computer model of multiple processors shown in Figure 1, the CPU can be Intel x86 series, PowerPC series, ARM series, etc. Boot Processor (the master processor) is responsible for system management, through which the work of each Application Processor (the slave processor) can be coordinated, and the Boot Processor also initializes the shared memory. In order to improve the power-on efficiency of the system, each processor needs its own fash electronic disk to store programs, and each processor can have external devices (such as network, keyboard, etc.).

Second, the software design of multiprocessor parallel computer
In order to improve the execution efficiency of the processor, the general computer system adopts the real-time multitasking operating system. Based on the embedded VxWorks operating system, this paper discusses the software design method of multi-CPU parallel computer.
1. Shared memory network
In VxWorks operating system, the communication among multiple CPUs uses the Shared-Memory BackplaneNetwork technology. This technology uses virtual network to manage shared storage devices. The shared memory network driver allows the communication between multiple processors to take the form of network, and the usage specification conforms to BSD4.4 compatible mode. The shared memory can reside on the CPU motherboard or a separate memory board.
BP is the Boot Processor, and AP is the Application Processor. The master processor is provided with the master route 200.200.200.0, and the slave processor can communicate with the external network through the master processor. The main processor must have two network interfaces, one for communication with the external network (such as communication with VxWorks development host Vx-Host), and the IP address is set as the 90.0.0.10 in Figure 2; The other is a virtual shared memory network, which is used to communicate with slave processors. The network IP addresses configured from the processor are 200.200.200.1, 200.200.200.2 and 200.200.200.3. When debugging the program, the main processor first initializes the shared memory network (including setting the memory address) and downloads its own VxWorks image; from the development host; Then, schedule the slave processor (AP) to download the VxWorks image required by the slave processor from the development host VxHost through the IP address 90.0.0.10, and run the operating system. All debugging of the slave processor is done by the master processor.

The shared memory network is a module of VxWorks, and it must be selected in tornado. Each processor has its own boorom or VxWoks image, which runs its own operating system independently, and communicates with each other through shared memory.
2. Master device of shared memory network
In a multiprocessor system, one processor acts as the master device. The functions of the Shared-MemoryNetwork Master in the system are explained as follows:
(1) initializing a shared memory area and a shared memory anchor;
Maintaining the heartbeat of the shared memory network;
(3) as a gateway for other processors to communicate with the external network;
⑶ Allocate shared storage area.
In VxWorks operating system, the shared storage area is required to be a continuous storage address space, and the default is 16MB, which is defined in the network driver. The master is responsible for allocating shared storage areas for other processors and performing memory mapping. The location of shared storage area depends on the system configuration. All processors of must be able to access this area by using the hook function. The shared hook is the communication reference point of all processors. Hooks and shared memory areas can be placed in dual-port RAM. The hook contains the physical address offset of the real storage area, which is set during the initialization of the master device. The hook and the storage area must be in the same address space, and the address must be linear and valid.
After the master device of the shared memory network is initialized, all processors can use the shared memory network. However, the main processor can’t really interfere with the data packet interaction between other processors through the network, and the communication between processors is carried out by local interrupt or query. When the shared memory is initialized, all processors, including the main processor, use the network equally. In Tornado2.0 environment, the master processor number is specified as 0, and the system identifies the master processor and the slave processor by the processing number. Typically, the main processor has two Internet addresses, which are used for external network communication and internal gateway respectively.
3. Network heartbeat of shared storage area
In a multi-CPU system, all processors can communicate through the network only after the shared storage area is initialized, so each processor needs to know whether the shared network is activated or ready. Here, heartbeat detection is used to make each processor know the status of the network.
Heartbeat is a counter, which is counted every second by the main processor. Other processors monitor the heartbeat value to confirm whether the shared network is normal or not. Other processors usually monitor heartbeat every few seconds (depending on the specific situation). The heartbeat offset address of the shared memory is placed in the word of the fifth 4-section in the header of the shared memory. As shown in Figure 3.

4. Communication between processors.
Communication between processors can be interrupted or queried. Each processor has an input queue for receiving data packets sent by other processors. When the query mode is adopted, the processor queries whether the queue has received data at fixed time intervals. When the interrupt mode is used, the sending processor informs the receiving processor that there is data in the input queue. Interrupt mode can be bus interrupt or mailbox interrupt, which is more effective than query mode.
Parallel system with multiple CPUs is similar to an embedded distributed system, and the communication between them can adopt distributed message queue and fractional database technology. The combination of distributed message queue and distributed database technology provides a transparent communication platform for all processors in the system. Processors access distributed message queues as if they were accessing their own resources. Distributed message queue technology can simplify the design of application programs and speed up the development of the system.
5. Resource allocation among multiple processors
In a single-board computer system with multiple processors, the most important point is to consider the parallel execution efficiency of tasks. Multiple processors need to access peripheral devices and communicate data, so there is the problem of external device allocation.
There are two kinds of allocation of equipment resources: one is customization (i.e. static allocation), that is, the single-board computer allocates resources well when it is designed, but the disadvantage is that the adaptability is not strong, and the resources cannot be changed according to the needs of users; The second is dynamic allocation, in which FPGA logic is loaded on the board, and the software interface is reserved, which can be specified dynamically by users according to the requirements of tasks. The whole resource control is transparent, and it is unnecessary to know which CPU controls it. In hardware design, arbitration and priority setting of CPU and external devices should be considered to prevent conflicts caused by accessing critical resources. Then the software should specify which CPU uses the specific device, and the other CPUs should be mutually exclusive when accessing.
IV. Performance of multiprocessor parallel computers
In this system, the CPU type is Intel Pentium3 processor, and the main frequency is 700MHz. Methods: The data processing algorithm with the same function was used to decompose it into modules, which were run in each processor of the system. The test results show that, compared with single CPU, using two CPUs can improve the computing performance by 60% ~ 70%. With three CPUs, the computing performance can reach at least 2 times. As we know, the biggest factor affecting this test result is the test method, which decomposes the algorithm with the same function into multiple processors, and the decomposition method directly determines the comprehensive processing efficiency. But it is certain that the parallel processing design of multiple processors can greatly improve the computing efficiency of the system.

Leave a Reply

Your email address will not be published. Required fields are marked *