OK, I have made some testing, using compiler XC8 v1.31
compiling for PIC16F88 and PIC18F26K20
1ms loop for PIC18
1048 ;main.c: 65: _delay((unsigned long)((1)*(16000000L/4000.0)));
1049
1050 ;incstack = 0
1051 001AB2 0E06 movlw 6
1052 001AB4 0100 movlb 0 ; () banked
1053 001AB6 6FD9 movwf ??_main& (0+255),b
1054 001AB8 0E30 movlw 48
1055 001ABA u6247:
1056 001ABA 2EE8 decfsz wreg,f,c
1057 001ABC D7FE goto u6247
1058 001ABE 2FD9 decfsz ??_main& (0+255),f,b
1059 001AC0 D7FC goto u6247
1060 001AC2 D000 nop2
1ms loop for PIC16
1448 ;main.c: 89: _delay((unsigned long)((1)*(16000000L/4000.0)));
1449
1450 ;incstack = 0
1451 ; Regs used in _main: [wreg]
1452 0678 3006 movlw 6
1453 0679 1683 bsf 3,5 ;RP0=1, select bank1
1454 067A 1303 bcf 3,6 ;RP1=0, select bank1
1455 067B 00D3 movwf (??_main^(0+128)+1)
1456 067C 3030 movlw 48
1457 067D 00D2 movwf ??_main^(0+128)
1458 067E u3287:
1459 067E 0BD2 decfsz ??_main^(0+128),f
1460 067F 2E7E goto u3287
1461 0680 0BD3 decfsz (??_main^(0+128)+1),f
1462 0681 2E7E goto u3287
1463 0682 0000 nop
1000ms loop for PIC16
1448 ;main.c: 89: _delay((unsigned long)((1000)*(16000000L/4000.0)));
1449
1450 ;incstack = 0
1451 ; Regs used in _main: [wreg]
1452 0667 3015 movlw 21
1453 0668 1683 bsf 3,5 ;RP0=1, select bank1
1454 0669 1303 bcf 3,6 ;RP1=0, select bank1
1455 066A 00D4 movwf (??_main^(0+128)+2)
1456 066B 304B movlw 75
1457 066C 00D3 movwf (??_main^(0+128)+1)
1458 066D 30D1 movlw 209
1459 066E 00D2 movwf ??_main^(0+128)
1460 066F u3287:
1461 066F 0BD2 decfsz ??_main^(0+128),f
1462 0670 2E6F goto u3287
1463 0671 0BD3 decfsz (??_main^(0+128)+1),f
1464 0672 2E6F goto u3287
1465 0673 0BD4 decfsz (??_main^(0+128)+2),f
1466 0674 2E6F goto u3287
1467 0675 0000 nop
1000ms for PIC18 was not possible, so I did 40ms almost the maximum possible
1048 ;main.c: 65: _delay((unsigned long)((40)*(16000000L/4000.0)));
1049
1050 ;incstack = 0
1051 001AB2 0ED0 movlw 208
1052 001AB4 0100 movlb 0 ; () banked
1053 001AB6 6FD9 movwf ??_main& (0+255),b
1054 001AB8 0ECA movlw 202
1055 001ABA u6247:
1056 001ABA 2EE8 decfsz wreg,f,c
1057 001ABC D7FE goto u6247
1058 001ABE 2FD9 decfsz ??_main& (0+255),f,b
1059 001AC0 D7FC goto u6247
Summarizing:
for larger numbers for PIC16 the compiler is adding one more loop and one more memory byte for storing loop counter.
Looks like the PIC18 is limited for the two stage loops with one memory and W register.
One inner loop takes 3 cycles so (255*3)*255 = ~195000 cycles. A bit more then the documentation states
I did one more try with the pure _delay() function, that counts the given number of cycles.
I was able to successfully compile 197120 cycles delay
1048 ;main.c: 65: _delay(197120);
1049
1050 ;incstack = 0
1051 001AB2 0E00 movlw 0
1052 001AB4 0100 movlb 0 ; () banked
1053 001AB6 6FD9 movwf ??_main& (0+255),b
1054 001AB8 0EFF movlw 255
1055 001ABA u6247:
1056 001ABA 2EE8 decfsz wreg,f,c
1057 001ABC D7FE goto u6247
1058 001ABE 2FD9 decfsz ??_main& (0+255),f,b
1059 001AC0 D7FC goto u6247
1060 001AC2 F000 nop
When I increased the number I've got the compilation error:
main.c:65: error: (1274) delay exceeds maximum limit of 197120 cycles
So, the compiler knows the limit. I think there's an error in the documentation
Sorry for such a long post