If you are writing a program in FORTRAN, we recommend that you add implicit.
None, especially when there are a lot of code, you can check many problems in the compilation process.
1,
- [Root @ c0108 parallel] # mpiexec-N 5./simple
- Aborting job:
- Fatal ErrorInMpi_irecv: Invalid rank, error Stack:
- Mpi_irecv (143): mpi_irecv (BUF = 0x25dab60, Count = 0, mpi_double_precision, src = 5, tag = 99, mpi_comm_world, request = 0x7fffa02ca86c) failed
- Mpi_irecv (95): Invalid rank has value 5 but must be nonnegative and less than 5
- Rank 4InJob 5 c0108_52041 caused collective abort of All Ranks
- Exit status of Rank 4:ReturnCode 13
The above indicates that process number 5 is invalid because [root @ c0108 parallel] # mpiexec-N 5. /When simple runs, five processes are enabled: 0 1 2 3 4, so it must be a problem of the Code itself, but not necessarily a process number itself, it may also be that the passing of a parameter is not successful, and there will always be many inexplicable errors in MPI...
In my code, the mpi_irecv statement is limited. Therefore, debug by adding the print statement to find the line of the error code, as shown below:
Print *, myid + 1, '20140901 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Call mpi_irecv (P (1, 1, location), IMAX * Jmax * min (ITSP, Ke-Myke ),
& Mpi_double_precision, myid + 1, rely, mpi_comm_world, req, ierr)
2,
- [Root @ c0109 test] # mpiexec-N 5./simple
- Rank 3InJob 22 c0109_000064 caused collective abort of All Ranks
- Exit status of rank 3: killed by signal 11
- [Root @ c0109 test] #
- Signal 11 is a segment error. Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually
A bug in the program.
3,
- [Root @ c0108 test] # mpirun-NP 4./simple
- Aborting job:
- Fatal ErrorInMpi_wait: Invalid mpi_request, error Stack:
- Mpi_wait (139): mpi_wait (request = 0x7fff1f675228, status0x7fff1f675218) failed
- Mpi_wait (75): Invalid mpi_request
- Rank 2InJob 24 c0108_52041 caused collective abort of All Ranks
- Exit status of Rank 2:ReturnCode 13
Solution:
Generally it's because mpi_test of mpi_wait is supplied a request thatis unknown to mpich (the request wasn't the one returned by mpich
Whenyou made the isend/irecv/send_init/recv_init) means that mpi_irecv does not match mpi_wait (req, status, ierr), and the handle has an error code .. If there are many mpi_wait () functions, you can use the annotation method to lock errors one by one... In addition, if it is a FORTRAN program, first check the status variable definition: integer
Req, status (mpi_status_size), ierr
4,
Aborting job: Fatal ErrorInMpi_init: Other MPI error, error Stack: mpir_init_thread (195): initialization failed mpid_init (170): failure during portals initialization failed (321): progress_init failed (653 ):
Out of memory
There is not enough memory on the nodes for the program plus MPI buffers to fit.
You can decrease the amount of memory that MPI is using for buffers by using mpich_unex_buffer_size environment variable.
Thank you for your comments and comments!