Solved

Determining the factors that limit the number of Cores used in Global Optimization

  • 10 August 2023
  • 21 replies
  • 223 views

Userlevel 3
Badge +4

Link:  Max Cores

At the risk of repeating the question above, now I’m being more specific:

  • How do I diagnose the reason that fewer than 100% of the cores are being used?

The model is:

  1. Sequential lens
  2. Merit function is the default minimum spot size with constraints on thickness / radii
  3. There are (x3) configurations for thermal

I am seeking Global Optimization solutions.

I have a “new” computer:

  • (x32) Cores
  • 128 GB RAM

When I started the optimization last night, the fan started spinning so I knew (?) it was working hard.  This morning, all is quiet.  (x7) of (x32) Cores are working on the problem and I have very little physical RAM in use as reported by the Windows Resource Monitor.

Looking like it cannot find a better design

Referencing the question above:

  • The number of assigned variables during the optimization (x20 variables and fewer than x20 cores used)
    • OpticStudio will only use as many cores as the number of variables assigned
  • The amount of RAM in your system  (Way more memory available than cores)
    • For each variable assigned, OpticStudio loads a copy of the entire system into memory before tracing/ evaluating.
      • Systems with imported CAD parts can commonly run out of RAM before utilizing all available cores.
    • To check the estimated memory required per copy of your system:
    • Way more memory that necessary

       

  • Last, some Analyses/ Merit Function Operands can only be executed single-threaded (All Analyses windows are closed except the Spot Diagram and Layout.  Which ones are single-threaded?)
    • This could result in intermittent reduction in core usage during an optimization cycle. 
  1. Closing Analysis windows did not increase the number of cores used
  2. Turning off Auto Update, increases the number of cores used for Hammer Optimization
  3. Turning off Auto Update for Global Optimization still doesn’t increase the number of cores to (x20 = number of variables)
  4. Just started another Global Optimization.  (x19) variables.  Right at the start, all (x32) cores are being used
    1. Does this imply that the number of cores is related to the progress in searching?  or the remaining size of the optimization space that has been explored?
icon

Best answer by Kevin Scales 9 November 2023, 23:06

View original

21 replies

Userlevel 7
Badge +3

I’m not sure exactly what the question is, but I would expect Global Search, Hammer and Tolerancing to all spawn as many threads as you have cores. However, if each thread executes really quickly, you can get the first thread returning before the n+1th thread can get launched, As a result, because threads are returning with around the same speed as being launched, you might not get all CPUs utilized.

From your screenshot that not what’s happening here. Is there maybe some interaction between the #tosave and #cores? It’s odd that #tosave is 10 and you’re only getting 10 cores running. That would be a bug if true.

Userlevel 3
Badge +4

I’m not sure exactly what the question is

Mark, the question is, when all the cores are not being used, how do I

  1. Detect this condition?
  2. Determine how to change the model / merit function such that all cores are being used

This assumes that more cores = better solution in less time.  I hope this is a good assumption!

From your screenshot that not what’s happening here. Is there maybe some interaction between the #tosave and #cores? It’s odd that #tosave is 10 and you’re only getting 10 cores running. That would be a bug if true.

Exactly why I am asking for advice… the results do not make sense to me.  Sandrine’s post is not specific enough for me to debug this myself:  link.

Userlevel 7
Badge +3

In general, users can’t debug uissues to do with threading. That’s between the application and the OS. By the time something’s working at the speed of someone types into the UI, it’s single threaded :-)

I can only suggest changing the number to save to see if the number of cores tracks. Either way, I think this is one for support, and there’s nothing a user can do to debug or fix.

Userlevel 3
Badge +4

Agreed… waiting for Zemax staff to respond here or I will send in a ticket.  Not sure how long is polite to wait before sending in a ticket.

I’m assuming that not too many other people check to see how many cores are working.  If they did, I’d love to hear about the experiences of other users.

-B

Userlevel 3
Badge +4

No response at this time.  Still trying to figure out:

How do I build my model and my merit function to maximize the use of threads/cores?

Userlevel 2

Have you tried to change the priority of the OpticsStudio process to “Above average” in the task manager?

Userlevel 3
Badge +4

Yes

Userlevel 4
Badge +1

There are a lot of factors that go into how many cores a given machine will use for a given system, and the number requested in the optimization run is only one of them. Very generally, OS will attempt one core per variable defined, but will also consider how much memory is needed to make a complete copy of the system for each core to run on, and the cores allocated is determined by the CPU, not by OpticStudio. OS can only decide how many to request.

Brian, to answer your questions, this is not something the user has much control over. The calculations are very much under the hood, and the actual core count is not accessible within OS. There are things you can try in order to increase the core count but 1. These are only attempts, and may not have an effect, and 2. Your assumption about more cores equaling better performance is not accurate. This last one is the source of many questions and is worth recapping here.

Utilizing multiple cores to perform long series of more-or-less identical routines can certainly speed up a long process by parallelizing it, but there is also a lot of overhead involved. OS has to actually do the process, and it needs to allocate the RAM memory for it, and it has to coordinate when different cores finish their tasks at different times. Most of this under-the-hood calculation is outside what the optical designer can reasonably hope to predict in advance. We have noted often that deliberately reducing the requested core allocation by the user can immediately create an decrease in run time because of the reduced overhead. It can also be the case sometimes that reducing the requested core count to a lower number that is still above what you’re getting can increase the actual number used. So, for example, if you have 32 available and OS is using 7, requesting 8 or 10 may result in getting a few extra. And, as I mentioned, that increase may or may not be beneficial to the run time. This is mainly trial and error. We do not have a general method to improve core allocation beyond what OS already tries to do, and a lot of it is outside the control of OS anyway. There’s nothing to be debugged in most cases, aside from tweaking the requested core count to a lower number and maybe getting an improved performance.

Userlevel 3
Badge +4

Thanks for the reply.  It reminds me of my studies in parallel computing in the 1990s during my PhD.  At that time, we were conceiving of multiple cores and writing specialized code to parallelize algorithms for use on these types of machines.

The answer you provided indicates to me that there is limited value in a multi-core processor on my PC.  If I can’t run it faster, why do I want more cores?  OpticStudio is really one of the few software applications that take appreciable time on my PC, so if I can’t get OpticStudio to run faster, then I shouldn’t pay for more cores.

Your answer also casts a shadow on using the Cloud for optics optimization.  I’ve had more than one customer say to me: 

Brian, can we just put this in the cloud and get a faster answer?

To date, I don’t have a licensing solution for that problem.  However, your answer indicates that even if it were affordable and possible to execute, I have little indication that an optimization will run significantly faster.

These observations are not academic for me.  They are quite practical. 

I need to decide:

  • Do I buy a new computer for $#### or go to a conference (or take a vacation!)?
  • Do I tell customers that we should advocate for optics in the cloud or just wait for it to happen?
Userlevel 7
Badge +3

Interesting questions Brian.

I can’t speak for where Ansys is going, obviously, but I do know a lot about how we got to where we are now. In the past, we never prioritized simply lighting up all the cores. The metric was always the total time taken to complete the task. Most often, and the default case, more processors is better, especially when there are relatively few processors. But it isn’t always, for the reasons Kevin and I have given. 

There is a good example of Amdahl’s law on this Wikipedia page: https://en.wikipedia.org/wiki/Amdahl's_law. Stealing a graphic from there, and quoting its caption:

The speed-up of a program from parallelization is limited by how much of the program can be parallelized.  For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20× as shown in the diagram, no matter how many processors are used. If 90% of the program can be parallelized, the theoretical maximum speed-up using parallel computing would be 10×.

No way round this. It doesn’t matter if you’re using multiple processors or multiple machines, or what. Parallelization is not a magic bullet, but it can have good, positive benefits. BUT, getting 20x is a great improvement, even if you can’t get to 40x by just adding more processors.

I feel your pain about customers asking if the Cloud won’t just solve everything, or if putting it on a GPU card won’t just solve everything, etc. Sadly there are no simple answers. There is pure processor speed, the amount of memory each processor can access, the speed of that memory, the cache memory for each processor and so on.

We did actually offer a Cloud version of OS about five years ago, but it was not a commercial success except with people who were developing Cloud capability. OS is a great test bed for multi-core machines as it really can work all the cores, and not many codes can really do it as well as OS does it. When you look at the spec of a virtual machine on AWS or Azure, their ‘High Performance’ options are not as good as a medium grade engineering workstation. If you’re willing to pay ~$5k or more for a workstation, a physical machine on your desk is still the best way to go IMHO. The Cloud benefits are to do with team collaboration etc, not raw horsepower. 

For someone doing the weird and wonderful stuff I know you do, I think a workstation will be better than the Cloud for some time.

  • Mark
Userlevel 7
Badge +3

BTW, I would be curious to hear from anyone using FRED’s MPC/GPU module. Neil Barrett tells me it’s pretty good, but I’ve not heard from anyone using it as a daily driver.

Userlevel 3
Badge +4

@Mark.Nicholson 

Thanks for the sympathy.

Seems to me that Global Optimization and Local Optimization can easily be ported to multiple cores.  Let’s face it, we know that the optimization is a non-linear multi-variable problem.  Thus, more sampling and more information is more better.  To compute gradients or Jacobians, we need multiple computations.  Granted, we need memory for these, but that’s cheap.  So, from my perspective, a copy of each system with one core to rule them all makes a lot of sense. 

This seems like a simple partitioning problem.  I could literally do this with the API if I could deploy the images on multiple cores.  So, the real question is why can’t I?  If the answer is that the O/S is not good enough, then it should be deployed on an O/S that is good enough otherwise the technology will never advance.

Userlevel 7
Badge +3

Hey Brian,

I don’t think you’re quite getting my point. Imagine a NSC ray trace, you set up the trace dialog and press GO. A single, serial piece of code has to run to build each thread, copy the data, copy the instructions, and so on. That master thread itself cannot be parallelized, and it takes X time to run. Then it launches threads and received threads back. It has to read the data back and launch other threads. While it’s launching one thread, another thread returns but has to wait until the master thread has finished launching the latest before it can be read back. After that thread is launched, let's say two more threads return, so we now have three threads that need to be launched. And we can’t launch any more threads until those three have been read back in.

This is the crux of Amdahl’s law. As you increase threads, performance improves linearly, until it starts to asymptotically approach the maximum benefit given by Amdahl’s Law. Once you get to about 80% of the maximum benefit you hit diminishing returns. That can still be a HUGE benefit, but you can’t just keep increasing the number of cores infinitely. The fraction of serial code required ultimately limits the maximum gain.

So how fast the processor is, how fast the memory is and the cache memory that feed the processor ultimately limit the maximum gain available. 100 Pentium processors will still be slower than 100 Xeon processors, so the horsepower of the CPU matters, as it scales the time it takes to do the serial part of the code as well as the execution of the threads themselves.

So no, this is not a ‘simple partitioning problem’, nor could you do it with the API if only we provided feature X. Nor is a function of the operating system. There is a fundamental limit to how much benefit parallelization can give that you need to get your head around. It’s described in the article on Amdahl’s Law I gave.

  • Mark
Userlevel 3
Badge +4

So, I do get your point.

Now imagine you have enough memory (which I do) for each core to have a complete duplicate of the code and the model.  If that is the case, then as you work you way down the merit function, you can compute the Jacobian for each row in your merit function simultaneously.

Userlevel 3
Badge +4

And we can’t launch any more threads until those three have been read back in.

Why?  What is the blocking operation?

Userlevel 3
Badge +4

This is the book that I used when I was in graduate school:  link.

So, yes I understand that shared memory, individual processor memory, and load sharing of processors all play a role in parallel processing.

One could ask, why bother having multiple cores if you can’t go any faster.  If OpticStudio cannot go faster, and its operations are imminently parallelizable, then what software can go faster with multiple cores.

Userlevel 3
Badge +4

I think what you are trying to express is that the L1, L2, and L3 caches are not large enough for the optical model.

I wonder it that is really true…

I’m not so interested in running an analysis faster.  I’m interested in optimizing faster, covering the optimization landscape more finely, getting to better solutions in less time.

Userlevel 7
Badge +3

No...not quite

We do already compute the merit function line by line. This works best in NS mode, as in sequential you have default merit functions that depend on the centroid, so there are groups of operands that have to be computed in series (think of RMS spot per field per wavelength...it’s not just one operand).

But even if each line could be evaluated independently of the others, it will never be simultaneous in the absolute sense that you’re using it. There must be serial code that launches and manages the parallel code. The performance improvement cannot be linear forever. That’s Amdahl’s Law.

By all means, write your own code and see. Ken put years into this, and Tim and his team have taken it even further. But you cannot make an infinitely fast computer by giving it an infinite number of processors. It just doesn’t work like that!

BTW, this is just one example of ‘Moore’s Law Doesn’t Work Anymore’. This is Gordon Moore’s law that CPU speeds will double every 18 months. It was true for a long, long time but it has now flattened out. We’re now at point in technology where CPU speeds in consumer products are being throttled by the speed of light itself, as data moves around the chip. It’s really a great time to be alive if you work with computers, even if no exponential lasts forever 😀

  • Mark
Userlevel 3
Badge +4

We’re now at point in technology where CPU speeds in consumer products are being throttled by the speed of light itself, as data moves around the chip.

This was the basis of all the work in our research group for my dissertation.

so there are groups of operands that have to be computed in series (think of RMS spot per field per wavelength

I’m trying to express that the entire MF is computed on a single core.  Each core computes the entire MF with different optical systems (e.g. different glass, radii, index).  In this way, you compute MFa and MFb, difference the results and difference the variables you have changed and you have one entry in the Jacobian.

Why is this not possible?

Userlevel 3
Badge +4

I worked on the military version of this:  link.  HPCs have some of the technology to achieve that optical link between silicon.

Below:  parallel optical fiber to chip.

 

Userlevel 3
Badge +4

While doing my regular work, I encountered the following observation:

  1. A Non-sequential model I ran with 4E8 rays fully utilized all of the cores on my machine
  2. Sequential model optimization does not use all the cores on my machine

This is precisely the question I have. 

Why does the sequential optimization not use all the cores on my machine?

What do I need to do to my MF or model to make it use all the cores?

Reply