Page 1 of 1

Troubleshooting Bad WUs

Posted: Sun Oct 31, 2010 6:40 pm
by PantherX
While folding on your CPU/GPU, you may have encountered a Work Unit (WU) that has given an error. When that happens, the event falls into one of two categories; Problematic Hardware or Bad WU. It is almost impossible to tell the difference unless you know what you're looking for. The primary purpose of this topic is to help you differentiate between the two categories by eliminating Problematic Hardware, leaving only Bad WUs to be reported.

If you're new to the F@H Project, have a read of this topic for a quick overview: Welcome To The F@H Support Forum


1 - Problematic Hardware
Sometimes a WU assigned to your F@H Client is a good one but due to some specific hardware problem on your system, it cannot be folded correctly. Thus it has returned an error. This issue has to be fixed on the Donors system in order for the system to continue folding without issues (more details below).


2 - Bad WU
Sometimes a WU assigned to your F@H Client is a bad one but due to the statistical nature of folding. The F@H Team tries to keep a very low error rate such as:
  • <1% of the Project in release (i.e. full F@H)
  • Between 1% or 2% (inclusive) for advanced (pre-release testing by Public Donors)
  • Higher than 2 (inclusive) for earlier testing (Beta testers and internal testers)
Despite this, you may still occasionally get a WU that will fail. Unfortunately there's no way to tell which WUs are going to be bad until it is folded. This issue might be fixed by the F@H Team (more details below).

Note: The use of <client-type v='advanced'/> increases your probability of getting a bad WU since you will have access to pre-release WUs.

1 - Problematic Hardware

Posted: Sat Apr 18, 2020 3:11 am
by PantherX
  • Problematic Hardware
Due to the diversity of hardware, it is impractical to test the WUs on every setup. Before the F@H Team releases a Project to the public, they follow a certain protocol to ensure that their Project(s) run on as many systems as possible. Below is a brief summary of their protocol (it assumes that the Project passes each stage):
  • Prepare Simulations: Project undergoes simulation preparation which is done by researchers.
  • Internal Testing: Project undergoes internal testing to ensure that the new Project meets their standards on designated test hardware.
  • Beta Testing: Project is made available to Beta Testers (details) who ensure that the WUs can be processed without any errors on a range of available hardware.
  • Advanced Testing: Project is then made available to those F@H Donors that are using <client-type v='advanced'/> The Project is then monitored to see if any anomaly occurs due to the diversified nature of hardware folding this new Project.
  • Full F@H: Project is then made available to all F@H Donors. Rare error reports may still be encountered occasionally.
Stock / Overclocked Systems
Running the F@H Software on a stock system is very simple and there aren't many issues with these systems. However, if you're running the F@H Software on an overclocked system, there's a chance that your overclock is not stable. Officially, F@H recommends stock clock frequencies by the vendor (AMD, Intel, Nvidia). This Forum doesn't assist Donors to overclock their system. Special hardware forums are a much better place to seek assistance.

Please note that F@H Software is an extremely stressful application so an unstable system that generates errors will probably cost you more points than the overclock (factory/custom) gains. The errors also slow down the science since duplicate WUs are sent out to determine if the WU is bad (more details below) or not. Hence if your system produces errors, you must adjust it to the point that it can process the most demanding WU without errors. If you are not sure of something, please avoid doing it or ask before doing it.

FILE_IO_ERROR
  1. Run CHKDSK to ensure that the hard disk drive isn't faulty.
  2. Make sure that the folder isn't being "shared" by another Client. If you have multiple Clients, they must be running from separate folders.
  3. Some Anti-Virus programs can interfere with the folding files. Adding the folding directory to the exception list will avoid this problem.
  4. You may not have write permission for that folder so check your permission level.
  5. The partition may be full so consider freeing up some space.
There is no domain decomposition...
  1. Modify the CPU value (Advanced Control/FAHControl -> Configure -> Slots Tab -> CPU Slot -> Edit -> New CPU Value) to a smaller one which isn't a prime number bigger than 3 nor a multiple of 5. Generally speaking, the safe values are; 2, 4, 6, 8, 12, 24, 32, and 64. There can be other safe numbers too but that would be Project specific. You can create a topic or search the Forums. The reason some values fail is because the FahCore can't divide the assigned WU across all the CPUs (technical details). Create a new topic here, provide the log file, and state the original CPU value so that the F@H Team can prevent that Project from being assigned to those CPU values.
OpenCL: Not detected: clGetDeviceIDs() returned -1 OR clGetPlatformIDs() returned -1001
  1. Ensure you have installed the Vendor drivers (AMD, Nvidia)
Bad State detected... attempting to resume from last good checkpoint...
  1. Ensure that your system is sufficiently cooled (GPU-Z, Real Temp, HWMonitor, Core Temp, HWiNFO)
  2. Remove any overclock (including factory overclock) and see if that resolves the issue or not.
  3. It could be a Bad WU (more details below).
Miscellaneous Issues
  1. Ensure that you
  1. If you have overclocked/undervolt/overvolt your CPU, please return it to stock frequencies as per the vendor (AMD, Intel).
  2. If you have overclocked/undervolt/overvolt your RAM, please return it to stock frequencies as per the vendor.
  3. If you have overclocked/undervolt/overvolt your GPU, please return it to stock frequencies as per the vendor (AMD, Nvidia).
  4. Clean any dust build-ups from your system.
  5. Ensure that your fans are operating as normal.
  6. Ensure you're not using Beta drivers for your system, instead use WHQL drivers.

2 - Bad WU

Posted: Sat Apr 18, 2020 3:11 am
by PantherX
  • Bad WU
It is statistically impossible to have a Project without a single bad WU. You may encounter them, although they are rare. If you happen to get a WU which may be bad, please make a report of it in this here stating the PRCG and the relevant section of the log (details). Please note note that for each WU error the F@H Server recieves, an additional copy is sent out to ensure that the WU is bad rather than faulty hardware and is automatically stopped after a certain number of times. If the bad WU is stuck, do the following:
  1. Make a note of the Work Queue ID belonging to the bad WU
  2. Stop the F@H Software
  3. Navigate to the FAHClient folder (%AppData%\FAHClient, /var/lib/fahclient, /Library/Application Support/FAHClient)
  4. Navigate to the Work folder
  5. Delete the folder which matches the Work Queue ID above
  6. Start the F@H Software
After that, you will be assigned a new WU so you can continue folding. Do monitor the WU that you have reported using this site. If it turns out to be bad (3 or more reports of faulty), it is alright. If someone else completes it, then you need to check your system. Please remember that an occasional error can be expected and there isn't any definitive reason as to why it happened. It is generally okay to have an error or two once in a while but if the errors are frequent, then it is advised to look further to eliminate any cause of this problem so you can increase your contribution to F@H.



Additional comments are welcomed via Private Message (PM) to me. I wish to thanks the following users who have contributed in this thread (alphabetically):
7im, Assimilator1, bruce, toTOW, uncle fuzzy

Last Updated: 19 May 2020