cc65: Atari: Setting pixel data 10x slower than before

This is a strange bug, probably one for Chris Groessler! 😃

I have been using a 2018 version of CC65 for a while, and decided to upgrade to latest for next release of 8bit-Unity. Most functions work normally (such as loading bitmaps), but one function for setting pixels has become incredibly slow. The code is as follows (compiled using default atarixl.cfg):

#define BITMAPRAM1 (0x7010) // 7010-8f50 (bitmap frame 1)
#define BITMAPRAM2 (0xa010) // a010-bf50 (bitmap frame 2)

void SetPixel(unsigned int pixelX, unsigned int pixelY, unsigned char color)
{
	unsigned int offset;
	unsigned char shift, mask, col1, col2;	

	// Compute pixel location
	offset = 40*pixelY + pixelX/4;
	shift = 6 - 2*(pixelX%4);
	mask = 255 - (3 << shift);
	if ((pixelY+pixelX)%2) {
		col2 = (color%4) << shift;
		col1 = (color/4) << shift;
	} else {
		col1 = (color%4) << shift;
		col2 = (color/4) << shift;
	}

	// Set color/color2 in dual buffer
	POKE((char*)BITMAPRAM1+offset, (PEEK((char*)BITMAPRAM1+offset) & mask) | col1);
	POKE((char*)BITMAPRAM2+offset, (PEEK((char*)BITMAPRAM2+offset) & mask) | col2);
}

I wonder if the system is waiting for a VSYNC or something, whenever I touch the bitmap ram?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 40 (31 by maintainers)

Commits related to this issue

Most upvoted comments

if ((pixelY+pixelX)%2) {

Use

if ((pixelY^pixelX)&1) {

It does the same thing, but no clc. It’s only one byte smaller and two cycles faster, but every little bit helps!

A8 XL: 129 Ticks -> 190 Ticks

With 060417b0dc1018adc79fd16d0eb97c61d729aea3 ‘mul40’ and two small tables I’m down to 45 on the Atari.

#include <stdlib.h>
#include <conio.h>
#include <peekpoke.h>
#include <time.h>
#include <cc65.h>

#define BITMAPRAM1 (0x7010) // 7010-8f50 (bitmap frame 1)
#define BITMAPRAM2 (0xa010) // a010-bf50 (bitmap frame 2)

unsigned char x2Shifts[] = { 6, 4, 2, 0 };
unsigned char x2Mask[] = { 0b00111111,
                           0b11001111,
                           0b11110011,
                           0b11111100 };

int main (void)
{
    unsigned int i, offset;
    unsigned char shift, mask, col1, col2, pixelX, pixelY, xNibble, color = 3; 
#ifndef __SIM6502__
    clock_t timer = clock();
#endif;
  for (i=0; i<1024; i++) {
    // Compute pixel location
    offset = mul40(pixelY);
    offset += pixelX/4u;
    
    xNibble = pixelX & 3;
    shift = x2Shifts[xNibble];
    mask = x2Mask[xNibble];
    
    if ((pixelY+pixelX)%2) {
      col2 = (color&3) << shift;
      col1 = (color/4u) << shift;
    } else {
      col1 = (color&3) << shift;
      col2 = (color/4u) << shift;
    }

    // Set color/color2 in dual buffer
    POKE((char*)BITMAPRAM1+offset, (PEEK((char*)BITMAPRAM1+offset) & mask) | col1);
    POKE((char*)BITMAPRAM2+offset, (PEEK((char*)BITMAPRAM2+offset) & mask) | col2);
  }
#ifndef __SIM6502__ 
  cprintf("%lu", clock()-timer);
  while (1);
#endif;
    return EXIT_SUCCESS;  
}

Maybe it’s just me, but in general I would choose a table driven approach for the shifts and the 4 colours…

Here’s the main difference with #1328:

 ;
 ; if ((pixelY+pixelX)%2) {
 ;
-       ldy     #$02
+       ldy     #$01
        lda     (sp),y
-       clc
-       dey
-       adc     (sp),y
-       and     #$01
-       beq     L0015
+       jsr     pusha0
+       ldy     #$04
+       lda     (sp),y
+       jsr     tosadda0
+       jsr     pushax
+       ldx     #$00
+       lda     #$02
+       jsr     tosmoda0
+       stx     tmp1
+       ora     tmp1
+       beq     L0006

I don’t think this can be easily fixed since pixelY + pixelX has type int even though both operands are unsigned char. I could imagine some analysis that 0 <= pixelY + pixelX <= 2 * 255. Therefore, we could do an unsigned mod rather than a signed one. I’m not sure how hard that is, but it could be worth thinking about.

Changing (pixelY+pixelX)%2 to (pixelY+pixelX)%2U restores the old performance (along with #1328). The rest of the code can remain the same.