|Subject:||Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3|
|Date:||Tue, 04 Mar 2014 17:35:50 +0100|
|User-agent:||Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0|
I start a work to parellelize(with openmp, and then maybe cuda) the LU decomposition in gnucap.Regarding CUDA licensing, I refered to the FAQ #SystemLibraryException pointing to the section 1 of GNU/GPL3. Based on that I assume we could distribute the software with CUDA code in it but without the actual library. Not considering here libraries based on CUDA i.e. Cublas. To me, CUDA looks like one of the easiest way to work with GPGPUs since it provides debugging and performance analysis tools. CUDA would provide a good insight on what happens allowing me to detect issue before re-factoring the code using OpenCL. Which is more compliant with GPLv3 license.Openmp looks like a good way to do it. It looks simple, and comes with most compilers including gcc. Do not use cuda. Unless I misunderstand, the licensing of cuda makes it unsuitable for use in a GNU project.
Using our projects (big system with low accuracy), a lot of time is spent inside lu_decomp().I see your topic about parallelism, and i have some time to do it. Futhermore, we definitly need faster simulation result for our application. What kind of solution did you have in mind:Very simple .. Identify certain loops that can run in parallel. That is really all. You should look at the output of the "status" command to see where the time is spent, which will show where parallelism could be of benefit and how much benefit to expect.
In the LU decomposition, running the outermost loop in parallel should be all that is needed there. But to get enough benefit model evaluation also should be parallel, and likely more important.
I think that we can not parallelize the outermost loop because we noticed that most elements are dependant on the previous computations. (a block is dependent to all block beetwen its lownode(bn) and its row_value(mm) ) To bypass this issue we thought about using a dependence tree. This tree would be created when we get the system and updated each time the lownodes change. not making this dependence tree time consuming. Please refer to the PDF attached to this mail explaining it with some figures. Please keep in mind this document is a draft for now.
-Is it the "section" you design with row, diagonal, and column? In this case did you want to use , the fact that if all the section beetween _lownode[mm] and mm are calculated we could computed the element. In this case we could have a dependence graph(or tree) applied to your storage matrix section, mostlyy used to parrallelize Gilbert-Peierls Algorithm .I don't think that makes sense here, but you might want to try it. Remember .. gnucap's matrix solver usually does low rank updates and partial solutions. If you lose this feature it could make it so much slower that any parallel operation can not come close to recovering the loss. The simpler solver used for AC analysis is not parallel ready. To parallelize the AC matrix solution it may be necessary to switch to the other lu_decomp, which requires double matrix storage.-Is it an iterative method,with the problem that the convergence could take theorically an infinite number of operation.(so maybe not a good way)no -- not iterative -- except for the standard DC_tran Newton iteration which would not change.-Is it parallelize only the map(multiplicaton beetwen element) of dot product, and then maybe parallelize the reduction(addition beetween elements).I think the overhead of parallelizing the dot product would be too high, thinking of the multi-thread model. The dot product might be a candidate for GPU type processing, but look at "status" to judge whether there is enough potential benefit before doing this.
You are right, my tests show that the number of data transfert between host(CPU) and deviceGPU made it not suitable fr our kind of application. Still seem like a good way for very-large system. Though we need to do some testing to determine the matrix size making this interesting.
|[Prev in Thread]||Current Thread||[Next in Thread]|