Test-DONE
=========

+ sub_if_gte (avoid branches/if ...)  +small enh+ (also needed for vectors)
- __local tmps instead of res->... intermediates?  - no effect -
- inline small functions (automatically done)
- divergent branches (bad, bad, bad)
- select vs. "... ? ... : ..."
+ 3x24-bit kernel
- ((x<<8)>>8) instead of (x&0xFFFFFF)   (found in the forum) : no difference at all!
+ loop-unroll: OK, 64-loop in mod: unrolled 10% faster on GPU, no (big) change on CPU
+ vectorize (4 FC's at once?)
+ mad(ulong, ...) runs implicit I_TO_F + MULADD_e (float) ... SLOW!!!
+ hi= (hi << 1) + ((lo & 0x8000000000000000) ? 1 : 0) faster than (hi<<1)+(lo>>63ULL)

Test-TODO
=========
- 4x24-bit kernel
- in mod144_72: if q.d5!=0 ... will most likely skip the first block for one in 19 executions. Is that worth it?


TODO
====
+DONE+ - only set new params for subsequent kernel-runs
+DONE+ - cleanup mystuff
+DONE+ - cleanup cuda
+DONE+ - cleanup kernel file => test-kernel
+DONE+ - cleanup AMD licensed stuff
+DONE+ - KERNEL trace only the first thread
+DONE+ - check limits for 95-bit kernel
+DONE+ - add 64-bit kernel
+DONE+ - port 72 (3x24 bit) kernel
+DONE+ - call 72-bit kernel correctly
+DONE+ - port barrett kernels (div, mul, mod)
+DONE+ - create a kernel struct: name, cl_kernel_id, bit_min, bit_max
+DONE+ - save a copy / rev control
+DONE+ - re-enable auto-adjust of sieve-limit (measure wait-time)
+DONE+ - signal handler
+DONE+ - max 128 threads per block ... needed at all?
+DONE+ - --help
+DONE+ - complain about bad args
+DONE+ - calibrate auto-adjust
+DONE+ - cleanup unused kernels and kernel functions
+DONE+ - make it run on 11.10 / 11.11
+DONE+ - save copy of cpt file
+DONE+ - use old copy of cpt file if necessary
+DONE+ - allow running multiple instances in the same dir (all parameters as options)
+DONE+ - allow cpts only after some classes
+DONE+ - -d g
+DONE+ - "restrict" "const" wherever possible
+DONE+ - merge mfaktc 0.18
+DONE+ - allow reading ckp's of other versions (now even mfaktc)
+DONE+ - optimize 72-bit kernel for small vs. large factors (skip some converts depending on the size)
+DONE+ - result format changes
+DONE+ - versioning for txt files
+DONE+ - bugfix for "Only 1 platforms found, cannot use platform 1 ..."
+DONE+ - lock results file
+DONE+ - bugfix for total runtime calculation if a factor was found in a restarted run
+DONE+ - test optimisation options on Linux
+DONE+ - fast 64-bit kernel (mul24_63)
+DONE+ - 5x15bit based on mul24 (barrett15_75)
+DONE+ - display GHz-days/day: 0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

  // example using M50,000,000 from 2^69-2^70:
  = 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
  = 2.3912767291392 GHz-days

  // magic constant is 0.016968 for TF to 65-bit and above
  // magic constant is 0.017832 for 63-and 64-bit
  // magic constant is 0.01116 for 62-bit and below

  Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment
+  %C - class ID (n/4620)            "%4d"
+  %c - class number (n/960)         "%3d"
+  %p - percent complete (%)         "%6.2f"
+  %g - GHz-days/day (GHz)           "%7.2f"
+  %t - time per class (s)           "%6G"
+  %e - eta (d/h/m/s)                "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
+  %n - number of candidates (M/G)   "%6.2fM"/"%6.2fG"
+  %r - rate (M/s)                   "%6.2f"
+  %s - SievePrimes                  "%7d"
+  %w - CPU wait time for GPU (us)   "%6lld"
+  %W - CPU wait % (%)               "6.2f"
+  %d - date (Mon nn)                "%b %d"
+  %T - time (HH:MM)                 "%H:%M"
+  %U - username (as configured)     "%15s"
+  %H - hostname (as configured)     "%15s"
+  %M - the exponent being worked on "%d"  !! no fixed width !!
+  %l - the lower bit-limit          "%2d"
+  %u - the upper bit-limit          "%2d"

+DONE+ - run all tests with SievePrimes 256 and 1000
+DONE+ - SievePrimesMin
+DONE+ - two optional fields in mfaktc.ini for username and computerid, so that if the user fills them out then it puts
         the Primenet-style "UID: username/computername, " at the beginning of results lines
+DONE+ - output datestamp lines in results.txt [Sat Dec 31 21:29:40 2011]
+DONE+ - decode GPU-names
+DONE+ - bugfix for high SievePrimes
+DONE+ - reordering kernels dynamically
+DONE+ - worktodo.add
+DONE+ - GHz-days/day in the "total time ..." line
+DONE+ - better estimate of compute cores (CPU!)
+DONE+ - 6x15bit based on mul24
+DONE+ - test a montgomery kernel (1x64-bit, 6x15-bit: ~10% slower than barretts)
+NO+ - transfer 4 FCs in 3 ints +not needed due to GPU sieving+
+NO+ - dynamic sieve size, adjusting as SievePrimes changes +not needed due to GPU sieving+
+NO+ - test if fix SieveSizeLimit is faster : yes, is. Provide 3 fix size sieve functions and dynamically use them. +not needed due to GPU sieving+
+DONE+ - GPU-sieve (for 15-bit and 32-bit based barrett kernels)
+DONE+ - set affinity for main thread
+NO+ - mu = 2^(bitlevel+1) / FC on host? + not needed due to GPU sieving +
+NO+ - evaluate max mem alloc (device info) + not needed +
+NO+ - retrieve L1/L2-cache-size and optimize sieve accordingly at runtime + not needed due to GPU sieving +
+NO+ - verify factors using the modulo-kernels and clEnqueueTask + not needed +
+NO+ - 4 threads - 4 classes at once - also in host code to maximize GPU-utilization in a single program + not needed due to GPU sieving +
+NO+ - Llano: zero-copy-path (BeaverCreek) + not needed due to GPU sieving +
+DONE+ - _getcwd / _chdir to avoid problems on temporary loss of the device
+NO+ - CPU-sieving: multiple queues + not needed due to GPU sieving +
+DONE+ - save and reload binary kernels
+DONE+ - migrate to newer VS version (to VS12 - +2% CPU sieving speed)
+NO+ - AllowSleep + not needed due to GPU sieving +
+DONE+ - _kbhit + getchar to allow config changes at runtime
+DONE+ perftest modes for sieving (SIEVE_SIZE), data copy and kernel speed.
+DONE+ allow for kernel reload (e.g. different vector size, or -DSMALL_EXP)
+DONE+ Split device detection from kernel compilation to make use of device's capabilities
+DONE+ correct GHzDays prediction for factors found
+DONE+ Complete instructions on building on Windows (VS and MinGW)
+DONE+ use clCreateCommandQueueWithProperties instead of deprecated clCreateCommandQueue for OpenCL 2.0 and above
1 copy small exp handling from mfaktc
1 find out why 0.13/0.14 were faster for some kernels
2 build test kernels to auto-detect DP vs. SP and mul24 vs. mul32 performance
2 detect microarchitecture based on feature set rather than device name
2 MODBASECASE errors in 6x15bit-kernels
2 Try to build an assembler kernel for GCN (find possible optimizations like carry, mul24_hi, scalar unit etc)
2 Auto-adjust VectorSize per kernel
3 add James' PM1-F-vs-TF-NF list to selftest
3 serialize memory access for vector kernels
3 make/allow for small exps/bit-levels
3 SDK 2.6: __attribute__((always_inline)), set environment variable GPU_ASYNC_MEM_COPY=2, Profiler, __kernel __attribute__((vec_type_hint(uintn)))
3 test different local workgroup sizes, also in combination with the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier
3 test speed without event sync (just in-order-queue)
3 Performance: vectorize GPU-sieve / reading from sieve
3 auto-optimizer: parameterize possible uses of mad_hi, amd_bitalign and others, let a test mode find out about the best config
4 dummy kernel for performance testing of sieve and transfer
4 kernel trace more specific (bitmask)
4 error handling: RES[0] -= 100; save up to 3 ints in RES; only in single-vec
4 clGetKernelWorkGroupInfo in load kernels: preferred size, local mem
5 div/mul-modulo for 96 bits
5 cleanup: separate GPU- and CPU-sieve based kernels, compile and load only one set of them
5 repunit

merge mfaktc changes: small exps: should correct failing test