r500/index.tex

\documentclass[20pt]{article}

\usepackage{amsmath}
\usepackage[font=small,labelfont=bf]{caption}
\usepackage{hyperref}
\hypersetup{
    colorlinks=true,
    linkcolor=blue,
    filecolor=magenta,
    urlcolor=cyan,
    pdftitle={Dreamcast},
    pdfpagemode=FullScreen,
    }

\usepackage{graphicx}
\graphicspath{ {./images/} }

\usepackage{minted}
\usepackage{nicefrac}

\title{Radeon R500}
\date{}

\begin{document}

\maketitle
\href{images/x1950xt.jpg}{\includegraphics{images/x1950xt.jpg}}

\tableofcontents

\section{Introduction}

The primary/minimal project goal is ``draw a triangle on a Radeon R500 via
direct memory-mapped hardware register and texture memory accesses''. This means
no \href{https://mesa3d.org/}{Mesa}, no
\href{https://github.com/torvalds/linux/tree/v6.12/drivers/gpu/drm/radeon}{radeon}
kernel module, and certainly no OpenGL or Direct3D.

I have worked directly with several other graphics units in the past
(\href{https://github.com/buhman/saturn-examples}{Saturn VDP1},
\href{https://github.com/buhman/dreamcast}{Dreamcast Holly},
\href{https://github.com/buhman/voodoo}{Voodoo 2}). In all of these projects,
my strategy is generally:

\begin{itemize}
\item read the entire \href{doc/R5xx_Acceleration_v1.5.pdf}{reference
  documentation} at least once, front-to-back
\item copy all hardware register definitions from the documentation to a
  spreadsheet or text file (sometimes typing everything by hand if I am in such
  a chill mood)
\item progressively build increasingly-complex example programs that exercise
  the hardware
\end{itemize}

The rabbit hole for R500 seems significantly deeper, considering this is the
first graphics unit I've worked with that has programmable vertex and pixel
shader engines.

\subsection{Hardware}

For testing, I currently have this hardware configuration:

\begin{itemize}
\item ASUS P4B-LX (Intel 845) motherboard
\item Intel Pentium 4 2.6GHz SL6PP (Northwood)
\item 1024 MB RAM
\item 32GB PATA SSD
\item ATI Radeon X1650 PRO 512MB AGP
\end{itemize}

I also have the X1950 XT PCIe shown in the photo, which amazingly has never been
used, and prior to the photo was sealed in an antistatic bag from manufacture to
now.

\subsection{Test setup}

While in my other (video game console) projects I typically insist on
``bare-metal'' development with no operating system or third-party library
running on the target hardware, my experience with x86 is much more limited.

While it is something I am interested in doing, I believe creating a
zero-dependency ``code upload'' mechanism for an x86-pc that does not depend on
an operating system would severely delay my progress on R500-specific work.

For my initial exploration of R500, I will instead be manipulating the hardware
primarily from Linux kernel space. This Linux kernel code does not actually
meaningfully depend on Linux APIs beyond calling \texttt{ioremap} to get usable
memory mappings for R500 PCI resources (texture/framebuffer memory and
registers).

\section{Progress: 07 Oct 2025}

From 01 Oct 2025 to 07 Oct 2025, I achieved the following:

\begin{itemize}
\item I wrote a reasonably complete AtomBIOS disassembler
\item I can disable (IBM PC) VGA mode and manipulate the native framebuffer
\item I can upload microcode to the ``command processor'', and I can write to
  scratch registers via command processor packets (this is uncoincidentally the
  same command processor test that the radeon kernel module does).
\item I stepped through Mesa functions as invoked by a simple OpenGL
  application, and created \href{mesa/glDrawArrays.txt}{a list of R500
    registers/values} that are written by Mesa during \texttt{glDrawArrays}.
\end{itemize}

I did not achieve the following:

\begin{itemize}
\item I attempted to manipulate the R500 register state and command processor
  into drawing a triangle, but I have not been successful yet
\end{itemize}

\subsection{Documentation}

In general, I note that the R500 documentation is significantly weaker than I
hoped, and does not contain enough information to draw a triangle on the R500
from the documentation alone (with no prior knowledge about previous Radeon
graphics units).

In addition to the lack of prose, in several cases I've noticed both Mesa and
Linux reference R500 registers that are
\href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/pci/undocumented_3d_registers.h}{not
  present at all} in the documentation.

\subsection{AtomBIOS}

AtomBIOS physically exists as a section inside the ROM on R500 graphics units.
AtomBIOS is notably used for setting PLL/pixel clock frequencies and display
resolutions, among several other functions.

The Radeon graphics hardware itself does not execute AtomBIOS code--instead, it
is expected that the host (e.g: x86) CPU evaluate the instructions in the
AtomBIOS command tables. Generally the outcome of evaluating AtomBIOS code is
that several ``register write'' instructions will be executed, changing the
state of the graphics unit.

My original goal in studying AtomBIOS was that I thought I would need it to set
up the R500 display controller to a reasonable state (as a prerequisite for
drawing 3D graphics). However, after actually experimenting with ``disable VGA
mode'', I currently believe that I don't actually need to implement
resolution/mode changes, and can proceed without it.

\subsection{PIO mode}

The Linux kernel exclusively communicates with R500 via ``PCI bus mastering''.
A ``ring buffer'' is allocated in ``GTT'' space, which from the graphics unit's
perspective exists in the same address space as framebuffer memory, but is an
address that is outside the framebuffer memory that physically exists.

I also observed via debugfs that the GTT apparently involves some sort of sparse
page mapping, but I don't understand how this works from an x86 perspective.

In the absence of an understanding of how to make my own ``GTT'' address space,
I attempted to operate the R500 in ``PIO'' mode. This has the advantage of being
able to simply write to registers via (simple) PCI memory-mapped accesses, but
it has the disadvantage that Linux doesn't use R500 this way, so I have no
reference implementation for how PIO mode should be used.

\subsection{Triangle drawing attempt \#1}

I translated my \href{mesa/glDrawArrays.txt}{glDrawArrays notes} to
\href{https://git.idk.st/bilbo/r500/src/commit/b6472e4c16946f44e02d82f31adaa411df009c67/pci/triangle.c}{equivalent
  register writes}.

This does not work, and I don't yet understand why. The main issue is that most
of the time when I execute that code, Linux appears to ``hang'' completely, and
my ``printk'' messages are never sent over ssh. On the rare occasion when the
``hang'' does not occur, a triangle is nevertheless not drawn on the
framebuffer.

I have a few ideas for how to proceed:

\begin{itemize}
\item Move the ``triangle.c'' register accesses to userspace via
  \texttt{/sys/bus/pci}, which might improve debuggability
\item Abandon the ``write a kernel module'' idea completely, and instead
  interact with the R500 via \href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_drv.c#L565-L577}{radeon DRM ioctls}
\end{itemize}

The latter is perhaps both the most attractive, and the most work. I currently
don't have any understanding of GEM buffers, radeon buffer objects, etc.., so
I'd need to study these in more detail.

\section{Progress: 14 Oct 2025}

From 08 Oct 2025 to 14 Oct 2025, I achieved the following:

\begin{itemize}
\item I studied how Mesa interacts with the \texttt{radeon} kernel module via
  \texttt{DRM\_RADEON\_} ioctls.
\item I wrote simple R500 \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/pvs_disassemble.py}{vertex shader} and \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/us_disassemble.py}{pixel shader} disassemblers.
\item I wrote a \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/parse_packets.py}{tool} to print R500 ``PM4'' packets in human-readable form.
\item I laboriously \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/bits}{copied and reformatted} all bit definitions from \href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}
\item I wrote \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs}{several other miscellaneous tools} related to register and bit parsing and manipulation.
\item I wrote two \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/single_color.c}{humble} \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/vertex_color.c}{demos} to draw a triangle on R500.
\end{itemize}

\subsection{Radeon DRM}

As implied in the last update, primarily due to my lack of experience with
bare-metal x86, I decided it would be a better approach to interact with R500
Command Processor via the \texttt{radeon} kernel module, which provides a
partially reasonable interface for this via the \texttt{DRM\_RADEON\_CS} ioctl.

All \texttt{DRM\_RADEON\_} ioctls are mostly or entirely undocumented. Instead,
I built debugging symbols for Mesa and other supporting libraries so that I
could set breakpoints in GDB to observe what sequences of \texttt{DRM\_RADEON\_}
ioctls Mesa uses.

From my previous \href{mesa/glDrawArrays.txt}{glDrawArrays notes} observations,
I noticed this strange sequence:

\begin{verbatim}
0x0000138a  // type 0 packet, count=0, starting offset = RB3D_COLOROFFSET0
0x00000000  // RB3D_COLOROFFSET0 = 0
0xc0001000  // type 3 packet, count=0, opcode=NOP
0x00000000  // zero (meaningless data)
\end{verbatim}

At first, it seemed Mesa was deliberately setting the colorbuffer write address
to (VRAM address) zero, which seemed like a strange choice considering I am
debugging an X11/GLX OpenGL application--surely the colorbuffer address would be
some non-zero value several megabytes after the beginning of VRAM.

I later attempted to send my own PM4 packet via \texttt{DRM\_RADEON\_CS}. This
initial attempt returned \texttt{Invalid argument}, with the following
message in dmesg:

\begin{verbatim}
[ 1205.978993] [drm:radeon_cs_packet_next_reloc [radeon]] *ERROR* No packet3 for relocation for packet at 14.
[ 1205.979427] [drm] ib[14]=0x0000138E
[ 1205.979433] [drm] ib[15]=0x00C00640
[ 1205.979437] [drm:r300_packet0_check [radeon]] *ERROR* No reloc for ib[13]=0x4E28
[ 1205.979545] [drm] ib[12]=0x0000138A
[ 1205.979548] [drm] ib[13]=0x00000000
[ 1205.979553] [drm:radeon_cs_ioctl [radeon]] *ERROR* Invalid command stream !
\end{verbatim}

This error message comes from
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L664-L669}{drm/radeon/r300.c}.

The meaningless data following the type-3 NOP packet is used by the kernel to
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L875-L889}{index}
the \texttt{DRM\_RADEON\_CS} ``relocs'' array (an array of GEM buffer handles).

It seems perhaps the design goal was to never expose the VRAM address of GEM
buffers to userspace (indeed there seems to be no way to retrieve that via any
GEM ioctls). This restriction is slightly disappointing, as I would have
preferred to be able to send unmodified packet data to the R500.

However, at the moment this does not appear to be a significant issue, as a
relatively small number of registers are modified by the Linux kernel's packet
parser prior creating the indirect buffer that is actually sent to the R500
hardware.

\subsection{Indirect buffers}

There appears to be a lot of memory-to-memory copying in the
Linux/Mesa/DRM/GEM/radeon graphics stack:

\begin{itemize}
\item Mesa writes the OpenGL state to various internal structures
\item Mesa \href{https://gitlab.freedesktop.org/mesa/mesa/-/blob/25.0/src/gallium/drivers/r300/r300_emit.c?ref_type=heads}{copies} OpenGL state to packet commands in a userspace buffer
\item Mesa
  \href{https://gitlab.freedesktop.org/mesa/mesa/-/blob/25.0/src/gallium/winsys/radeon/drm/radeon_drm_cs.c?ref_type=heads#L486-487}{passes
    the address} of the userspace buffer to the kernel via
  \texttt{DRM\_RADEON\_CS}
\item Linux
  \href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L340-L358}{copies
    the entire userspace buffer} to kernel space (calling kvmalloc/kvfree on
  each ioctl)
\item The \texttt{radeon\_cs\_parser} parses and modifies the buffer originally
  generated by Mesa
\item \href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L613}{radeon\_cs\_ib\_fill} copies the parser result to gpu address space.
\end{itemize}

Eventually,
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r100.c#L3709-L3722}{r100\_ring\_ib\_execute}
is called, which writes the indirect buffer address (now in GPU address space)
to the ring.

It would be interesting to experiment with writing a packet buffer directly in
GPU/GTT address space (from Linux userspace), with zero copies. This would
require an entirely new set of ioctls.

\subsection{Triangle drawing attempt \#2}

These images were never drawn on-screen. I extracted them from VRAM via
\texttt{/sys/kernel/debug/radeon\_vram}.

\begin{figure}
  \href{images/single_color_macrotiled.png}{\includegraphics{images/single_color_macrotiled.png}}
  \caption*{R500 framebuffer capture, \texttt{single\_color.c}}
\end{figure}

Though I was not aware of it yet, the above image was indeed my triangle, and
\texttt{COLORPITCH0} was merely in ``macrotiled'' mode. Once I realized this, I
produced this image (still in off-screen VRAM):

\begin{figure}
  \href{images/single_color.png}{\includegraphics{images/single_color.png}}
  \caption*{R500 framebuffer capture, \texttt{single\_color.c}}
\end{figure}

This \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/single_color.c}{``single color''} demo deliberately uses the very simple vertex and fragment
shaders:

\begin{figure}
\begin{verbatim}
instruction[0]:
  0x00f00203  dst: VE_ADD out[0].xyzw
  0x00d10001  src0: input[0].xyzw
  0x01248001  src1: input[0].0000
  0x01248001  src2: input[0].0000
\end{verbatim}
\caption*{R500 vertex shader (1 instruction, 128-bit control word)}
\end{figure}

This vertex shader is doing the equivalent of:

\begin{figure}
  \href{verbatim/vertex_shader_equivalent_single_color.glsl}{\includegraphics{verbatim/output/vertex_shader_equivalent_single_color.glsl.pdf}}
\end{figure}

The W component \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae//drm/single_color.c#L339}{comes from}
\texttt{VAP\_PROG\_STREAM\_CNTL\_EXT\_\_SWIZZLE\_SELECT\_W\_0(5)}, which
swizzles W to a constant \texttt{1.0}, despite W not being present in the vertex
data.

\begin{figure}
\begin{verbatim}
instruction[0]:
  0x00078005  OUT RGBA
  0x08020080  RGB ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
  0x08020080  ALPHA ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
  0x1c9b04d8  RGB_SEL_A=src0.110 RGB_SEL_B=src0.110 TARGET=A
  0x1c810003  ALPHA_OP=OP_MAX ALPHA_SEL_A=src0.0 ALPHA_SEL_B=src0.0 TARGET=A
  0x00000005  RGB_OP=OP_MAX
\end{verbatim}
\caption*{R500 fragment shader (1 instruction, 192-bit control word)}
\end{figure}

This fragment shader is doing the equivalent of:

\begin{figure}
  \href{verbatim/fragment_shader_equivalent_single_color.glsl}{\includegraphics{verbatim/output/fragment_shader_equivalent_single_color.glsl.pdf}}
\end{figure}

via the src swizzles. I think it is interesting that there are so many options
for producing inline constants within the fragment shader.

The ``target'' fragment shader field also seems interesting. I am excited to
write shaders that use multiple output buffers.

\subsection{DRM/KMS/GBM}

These renders were not displayed on-screen, so I looked for ways to correct
this.

Perhaps the most obvious method would be to write to the display controller
registers (\texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS}) via
\texttt{RADEON\_DRM\_CS}. However, this does not work due to the command parser
anti-fun implemented in
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L643}{r300\_packet0\_check}:
any register not present in that case statement is considered invalid, and the
packet buffer is not submitted.

I attempted to do this the ``right way'' via the DRM/KMS/GBM APIs. I then
learned that this does not behave correctly on my R500 because demos that wait
for the flag returned by \texttt{DRM\_IOCTL\_MODE\_PAGE\_FLIP} hang forever.

I noticed this earlier on Xorg/GLX as well, as I have been using the
\texttt{vblank\_mode=0} environment variable to avoid hanging forever in
\texttt{glXSwapBuffers}. This appears to be a Linux kernel bug, but I didn't
investigate this further.

\subsection{On-screen drawing}

I noticed in \texttt{/sys/kernel/debug/radeon\_vram\_mm} that the Linux console
is only using a single framebuffer (and does not double-buffer).

This is fortunate, because this means I can simply
\href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/pci_user/main.c#L48}{mmap
  the register address space} and write
\texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} myself without worrying about the
Linux console overwriting my change. I observed the \texttt{0x813000} value from
\texttt{/sys/kernel/debug/radeon\_vram\_mm}--there appears to be no other way to
get the vram address of a GEM buffer.

This is ``good enough'' for now, though at some point I'll want to learn how to
do proper vblank-synchronized double buffering.

\subsection{Triangle drawing attempt \#3}

I felt the next logical step was to learn how attributes and constants are
passed through the shader pipeline, so I then \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/vertex_color.c}{created a demo} that produced this image (this time also displayed on-screen):

\begin{figure}
  \href{images/vertex_color.png}{\includegraphics{images/vertex_color.png}}
  \caption*{R500 framebuffer capture, \texttt{vertex\_color.c}}
\end{figure}

\begin{figure}
\begin{verbatim}
instruction[0]:
  0x00702203  dst: VE_ADD out[1].xyz_
  0x01d10021  src0: input[1].xyz_
  0x01248021  src1: input[1].0000
  0x01248021  src2: input[1].0000
instruction[1]:
  0x00f00203  dst: VE_ADD out[0].xyzw
  0x01510001  src0: input[0].xyz1
  0x01248001  src1: input[0].0000
  0x01248001  src2: input[0].0000
\end{verbatim}
\caption*{R500 vertex shader (2 instructions, 128-bit control words)}
\end{figure}

This vertex shader is doing the equivalent of

\begin{figure}
  \href{verbatim/vertex_shader_equivalent_vertex_color.glsl}{\includegraphics{verbatim/output/vertex_shader_equivalent_vertex_color.glsl.pdf}}
\end{figure}

The extra vertex input is fed to the vertex shader via changes to
\texttt{VAP\_PROG\_STREAM\_CNTL\_0},
\texttt{VAP\_PROG\_STREAM\_CNTL\_EXT\_0}. Based on my currently limited
understanding, it seems that arranging the vertex data like this:

\begin{figure}
  \href{verbatim/vap_prog_stream_vertices.c}{\includegraphics{verbatim/output/vap_prog_stream_vertices.c.pdf}}
\end{figure}

Is easier to deal with in \texttt{VAP\_PROG\_STREAM\_CNTL} than:

\begin{figure}
  \href{verbatim/vap_prog_stream_vertices2.c}{\includegraphics{verbatim/output/vap_prog_stream_vertices2.c.pdf}}
\end{figure}

\begin{figure}
\begin{verbatim}
instruction[0]:
  0x00078005  OUT RGBA
  0x08020000  RGB ADDR0=temp[0] ADDR1=0.0 ADDR2=0.0
  0x08020080  ALPHA ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
  0x1c440220  RGB_SEL_A=src0.rgb RGB_SEL_B=src0.rgb TARGET=A
  0x1cc18003  ALPHA_OP=OP_MAX ALPHA_SEL_A=src0.1 ALPHA_SEL_B=src0.1 TARGET=A
  0x00000005  RGB_OP=OP_MAX
\end{verbatim}
\caption*{R500 fragment shader (1 instruction, 192-bit control word)}
\end{figure}

This fragment shader is doing the equivalent of:

\begin{figure}
  \href{verbatim/fragment_shader_equivalent_vertex_color.glsl}{\includegraphics{verbatim/output/fragment_shader_equivalent_vertex_color.glsl.pdf}}
\end{figure}

The \texttt{temp} input appears to be written by
\texttt{VAP\_OUT\_VTX\_FMT\_0\__VTX\_COLOR\_0\_PRESENT} and read due to the
changes to \texttt{RS\_COUNT} and \texttt{RS\_INST\_0}.

\section{Progress: 21 Oct 2025}

From 15 Oct 2025 to 21 Oct 2025, I achieved the following (roughly in chronological order):

\begin{itemize}
\item I learned how the vertex fetcher is \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/vertex_color_aos.c#L387-L401}{configured}
\item I learned how the ``point list'' drawing primitive can be used to \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear.c#L504}{clear the screen}
\item I invented a new syntax for R500 vertex shader assembly (ATI never specified one themselves)
\item I modified my R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/pvs_disassemble.py}{vertex shader disassembler} to emit this new vertex shader syntax
\item I wrote a R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/vs}{vertex shader assembler} that can process my vertex shader assembly syntax
\item I create several animated demos with \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate_vblank.c#L849-L859}{vblank-synchronized double buffering}
\item I learned how to configure and draw (multi-)textured triangles
\item I learned how to configure, clear, and use Z-buffers
\item I made a \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/texture_cube_clear_zwrite_vertex_shader.c}{textured rotating cube demo} that uses my first non-trivial \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/cube_rotate.vs.asm}{handwritten vertex shader assembly program}
\item I invented a new syntax for R500 fragment shader assembly (ATI never specified one themselves)
\item I wrote a new R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/us_disassemble2.py}{fragment shader disassembler} that emits this new fragment shader syntax
\item I wrote a R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/fs}{fragment shader assembler} that can process my fragment shader assembly syntax
\item I wrote a ``shadertoy''-style demo that uses my first non-trivial \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/shadertoy_palette.fs.asm}{handwritten fragment shader assembly program}
\end{itemize}

\subsection{DRM\_RADEON\_CS state tracking}

While attempting refactor one of my R500 demos to send fewer registers per
\texttt{DRM\_RADEON\_CS} ioctl, I found that there is a ``state tracker'' within
the \texttt{drm/radeon/r100}. For example, even if you don't use or depend on a
Z-buffer, \texttt{DRM\_RADEON\_CS} will still reject your packet buffer
depending on its own (imagined) concept of what the GPU state is. For example:

\begin{verbatim}
[ 1614.729278] [drm:r100_cs_track_check [radeon]] *ERROR* [drm] No buffer for z buffer !
[ 1614.729626] [drm:radeon_cs_ioctl [radeon]] *ERROR* Invalid command stream !
\end{verbatim}

This happens because \texttt{track->z\_enabled} is
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r100.c#L2435}{initially
  true} at the start of a \texttt{DRM\_RADEON\_CS} ioctl, and does not become
false unless the packet buffer
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L836-L843}{contains
  a write} to \texttt{ZB\_CNTL}.

This seems a bit heavy-handed. Even if the model were ``multiple applications
may be using the GPU, so a single application can't depend on previously-set
register state'', it would still be better if the kernel didn't try to enforce
this by restricting permissible content of a packet buffer.

\subsection{Vertex transform bypass}

Mesa uses a ``point'' 3D primitive to implement \texttt{glClear} on R500. It
does this by first uploading this vertex shader:

\begin{figure}
  \href{verbatim/mesa_glclear.vs.asm}{\includegraphics{verbatim/output/mesa_glclear.vs.asm.pdf}}
  \caption*{\texttt{mesa\_glclear.vs.asm}}
\end{figure}

This shader does nothing to the input other than copy it to the output, where
\texttt{out[0]} is the position vector, and \texttt{out[1]} is sent to the
fragment shader as a ``texture coordinate''. That fragment shader, in turn, does
not use the texture coordinate:

\begin{figure}
  \href{verbatim/mesa_glclear.fs.asm}{\includegraphics{verbatim/output/mesa_glclear.fs.asm.pdf}}
  \caption*{\texttt{mesa\_glclear.fs.asm}}
\end{figure}

In my ``clear''
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_rotate_vblank.c#L539}{implementation},
I instead set \texttt{PVS\_BYPASS}, which ``bypasses'' the vertex shader
completely, sending the vertices directly to the rasterizer. This is convenient
because it obviates the need to upload/change vertex shaders just to clear the
color and Z -buffers.

\subsection{Animation attempt \#1}

With a working colorbuffer clear, I wrote the
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate.c#L786}{single\_color\_clear\_translate.c}
demo to translate my triangle position coordinates in a loop that waits for
\texttt{DRM\_RADEON\_GEM\_WAIT\_IDLE} between each frame. This attempt
produced the following images:

\begin{figure}
  \includegraphics{videos/single_color_clear_translate.png}
  \caption*{R500 DVI capture, \texttt{single\_color\_clear\_translate.c}}
\end{figure}

This was intended to be a smooth animation, yet it is not. It also seems several
frames are never being displayed--the translation step is much smaller than what
is shown in the video.

This, interestingly, is exactly identical to how OpenGL/GLX applications behave
on R500 with \texttt{vblank\_mode=0}.

\subsection{Animation attempt \#2}

I read the R500 display controller \href{doc/RRG-216M56-03oOEM.pdf}{register reference guide} again.
It appears to suggest the \texttt{D1CRTC\_UPDATE\_INSTANTLY} bit, when unset,
might cause changes to \texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} to be delayed in
hardware until the next vertical blanking interval begins.

This can be combined with polling \texttt{D1GRPH\_SURFACE\_UPDATE\_PENDING} to
later determine when the vblank-synchronized frame change actually occured.

This is precisely what I implemented in
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate_vblank.c#L854-L855}{single\_color\_clear\_translate\_vblank.c}:

\begin{figure}
  \includegraphics{videos/single_color_clear_translate_vblank.png}
  \caption*{R500 DVI capture, \texttt{single\_color\_clear\_translate\_vblank.c}}
\end{figure}

This is much closer to what I intended. The
\texttt{D1GRPH\_SURFACE\_UPDATE\_PENDING} part is certainly working as I
expected. Setting/unsetting \texttt{D1CRTC\_UPDATE\_INSTANTLY} appears to have
no effect on \texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} behavior, so I feel my
understanding of R500 double-buffering is still incomplete.

\subsection{Multiple-texture sampling}

I am amazed and delighted how simple multiple-texture sampling is on R500.

As a counter-example, while Sega Dreamcast does have a fairly capable
fixed-function blending unit, to use the blending unit with multiple-texture
sampled polygons one needs to render the polygon multiple times (at least once
per texture) to an accumulation buffer. Blending is then performed between the
currently-sampled texture and the previously-accumulated result, and the blend
result is written to the accumulation buffer. From a vertex transformation
perspective, it can be inconvenient/inefficient to be required to buffer entire
triangle strips so that they can be submitted more than once per frame without
duplicating the clip/transform computations.

This is the fragment shader for
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/texture_dual.c}{texture\_dual.c}
(disassembly of code originally generated by Mesa):

\begin{figure}
  \href{verbatim/texture_dual.fs.asm}{\includegraphics{verbatim/output/texture_dual.fs.asm.pdf}}
  \caption*{\texttt{texture\_dual.fs.asm}}
\end{figure}

This pre-subtract multiply-add is an algebraic rearrangement of this GLSL code:

\begin{figure}
  \href{verbatim/texture_dual.fs.glsl}{\includegraphics{verbatim/output/texture_dual.fs.glsl.pdf}}
  \caption*{\texttt{texture\_dual.fs.glsl}}
\end{figure}

Which produces this image:

\begin{figure}
  \href{images/texture_dual.png}{\includegraphics{images/texture_dual.png}}
  \caption*{R500 framebuffer capture, \texttt{texture\_dual.c}}
\end{figure}

Being able to manipulate the texture samples as fragment shader unit temporaries
rather than as a sequence of accumulation buffer operations has me feeling excited
to do more with this.

\subsection{Z-buffer clear}

I've never worked with traditional Z-buffers before--Sega Saturn uses
\href{https://en.wikipedia.org/wiki/Painter\%27s_algorithm}{painter's algorithm}
exclusively, and Sega Dreamcast uses a ``depth accumulation buffer''
that isn't directly readable/writable.

It is slightly obvious in retrospect, but it took me several minutes to realize
that a ``depth clear'' can be implemented by covering the entire screen with a
``point'' primitive with the desired initial depth while \texttt{ZFUNC} set to
\texttt{ALWAYS}.

\subsection{Drawing a 3D cube}

With working double-buffering, Z-buffering, and the ability to clear each of
these every frame, I felt I was finally ready to draw something ``3D''.

I thought it would be fun to first start with a cube that is transformed in
``software'' on the x86 CPU (not using a vertex shader). This sequence of videos
shows my progression on implementing this:

\begin{figure}
  \includegraphics{videos/texture_cube.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube.c}}
\end{figure}

\begin{figure}
  \includegraphics{videos/texture_cube_clear.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear.c}}
\end{figure}

\begin{figure}
  \includegraphics{videos/texture_cube_clear_zwrite.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite.c}}
\end{figure}

\subsection{Drawing a 3D cube with vertex shaders}

I then decided it would be fun to hand-write a ``3D rotation'' vertex shader
from scratch. I first implemented the rotation in GLSL:

\begin{figure}
  \href{verbatim/cube_rotate.vs.glsl}{\includegraphics{verbatim/output/cube_rotate.vs.glsl.pdf}}
  \caption*{\texttt{cube\_rotate.vs.glsl}}
\end{figure}

\subsubsection{Remapping shader unit sin/cos operands}

Because this shader program depends on being able to calculate sin and cos, this
meant I immediately needed to understand how to use the \texttt{ME\_SIN} and
\texttt{ME\_COS} operations.

The R500 vertex shader ME unit clamps sin/cos operands to the range
$(-\pi,+\pi)$, as in:

\begin{figure}
  \href{diagrams/sin_clamp.pdf}{\includegraphics{diagrams/sin_clamp.pdf}}
\end{figure}

``Remapping'' floating point values from $(-\infty,+\infty)$ to $(-\pi,+\pi)$ is not
obvious. I was not previously aware of this transformation:

\begin{figure}
  \href{diagrams/sin_frac.pdf}{\includegraphics{diagrams/sin_frac.pdf}}
\end{figure}

Or, expressed as R500 vertex shader assembly:

\begin{figure}
  \href{verbatim/sin_operand_remap.vs.asm}{\includegraphics{verbatim/output/sin_operand_remap.vs.asm.pdf}}
\end{figure}

\subsubsection{Translation of the GLSL vertex shader to R500 vertex shader assembly}

Having verified that the GLSL version works as expected in OpenGL, and knowing
how to use the R500 vertex shader sin/cos operations, then I translated the GLSL
to R500 vertex shader assembly, as:

\begin{figure}
  \href{verbatim/cube_rotate.vs.asm}{\includegraphics{verbatim/output/cube_rotate.vs.asm.pdf}}
  \caption*{\texttt{cube\_rotate.vs.asm}}
\end{figure}

\subsubsection{Vertex shader assembler/code generator debugging}

However, when I first executed the vertex shader cube rotation demo, I found
it did not work as expected:

\begin{figure}
  \includegraphics{videos/texture_cube_clear_zwrite_vertex_shader_incorrect.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader.c}\\(incorrect vertex shader assembler output)}
\end{figure}

After hours of debugging, I eventually found the issue was in this instruction:

\begin{figure}
  \href{verbatim/cube_rotate_3_temp.vs.asm}{\includegraphics{verbatim/output/cube_rotate_3_temp.vs.asm.pdf}}
\end{figure}

\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} briefly mentions this on pages 98 and 99:

\begin{quote}
The PVS\_DST\_MACRO\_INST bit was meant to be used for MACROS such as a
vector-matrix multiply, but currently is only set for the following cases:

A VE\_MULTIPLY\_ADD or VE\_MULTIPLYX2\_ADD instruction with all 3 source
operands using unique PVS\_REG\_TEMPORARY vector addresses.  Since R300 only has
two read ports on the temporary memory, this special case of these instructions
is broken up (by the HW) into 2 operations.
\end{quote}

I read this paragraph much earlier, but I didn't fully understand it until
now. Indeed, this multiply-add has three unique \texttt{temp} addresses, and
must be encoded as a ``macro'' instruction.

I fixed this in my vertex shader assembler by
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/vs/validator.py}{counting the number of unique temp addresses}
referenced by each instruction, promoting \texttt{VE\_MULTIPLY\_ADD} to
\texttt{PVS\_MACRO\_OP\_2CLK\_MADD} if more than two unique \texttt{temp}
addresses are referenced.

With this change, reassembling the same vertex shader source code now produces a
correct vertex shader cube rotation:

\begin{figure}
  \includegraphics{videos/texture_cube_clear_zwrite_vertex_shader.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader.c}\\(correct vertex shader assembler output)}
\end{figure}

\subsection{Comparison with Mesa's R500 vertex shader compiler}

My ``cube rotation'' vertex shader,
\href{https://git.idk.st/bilbo/r500/src/commit/50244c7c95/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm}
is 15 instructions.

Mesa's R500 vertex shader compiler generated a
\href{https://git.idk.st/bilbo/r500/src/commit/50244c7c95/shader_examples/mesa/texture_cube_depth_vertex_shader.vs.txt}{27-instruction vertex shader}
from \href{https://r500.idk.st/verbatim/cube_rotate.vs.glsl}{semantically equivalent GLSL code}. Disassembly:

\begin{figure}
  \href{verbatim/mesa_cube_rotate.vs.asm}{\includegraphics{verbatim/output/mesa_cube_rotate.vs.asm.pdf}}
  \caption*{\texttt{mesa\_cube\_rotate.vs.asm}}
\end{figure}

I was not particularly trying to write concise code, but I find this difference
in instruction count to be surprising. In general it seems Mesa's R500 vertex
shader compiler failed to vectorize several operations, and does significantly
more scalar multiplies and scalar multiply-adds than my implementation.

Ignoring algorithmic improvements (such as lifting the sin/cos calculation to
x86 code and instead sending a 4x4 matrix to the vertex shader), there is still
more opportunity for optimization beyond my 15-instruction implementation.

Particularly, the vertex shader unit has a ``dual math'' instruction mode, where
``vector engine'' (VE\_) and ``math engine'' (ME\_) operations can be executed
simultaneously in the same instruction. \texttt{cube\_rotate.vs.asm} would
indeed benefit from such an optimization--most of the \texttt{ME\_SIN} and
\texttt{ME\_COS} instructions could be interleaved with the \texttt{VE\_MUL} and
\texttt{VE\_MAD} operations that follow (at significant expense to
human-readability).

I am curious to see more examples of the difference between Mesa's R500 vertex
shader compiler output and my own vertex shader assembly.

\subsection{Fragment shader instruction expressiveness}

Compared to the R500 vertex shader instructions, the R500 fragment shader
instructions are significantly more featureful. This makes inventing a syntax
that can fully express the range of operations that a R500 fragment shader
instruction can do more complex.

A significant difference is where R500 vertex shaders have a single tier of
operand argument decoding, as in:

\begin{figure}
  \includegraphics{diagrams/vertex_inputs.svg}
  \caption*{R500 vertex shader instruction operand inputs (simplified)}
\end{figure}

While R500 fragment shaders have multiple tiers of operand argument decoding, as
in:

\begin{figure}
  \includegraphics{diagrams/fragment_inputs.svg}
  \caption*{R500 fragment shader instruction operand inputs (simplified)}
\end{figure}

I've written several \href{https://github.com/buhman/scu-dsp-asm}{nice assemblers}
for other architectures in the past, but I've never seen any instruction set
as expressive as R500 fragment shaders.

I attempted to directly represent this ``multiple tiers of operand argument
decoding'' in my fragment shader ALU instructions syntax.

These instructions are also vector instructions: a total of 24 floating point
input operands and 8 floating results could be evaluated per instruction.

With this abundance of expressiveness and a relatively high skill ceiling, I'm
amazed R500 fragment shader assembly isn't more popular in programming
competitions, general everyday conversation, etc...

\subsection{Fragment shader assembler bugs}

There were two ``I spent a lot of time debugging this'' issues I encountered
with my fragment shader assembler.

The first was in this code I wrote to draw a fragment shaded circle, as in:

\begin{figure}
  \href{images/shadertoy_circle.png}{\includegraphics{images/shadertoy_circle.png}}
  \caption*{R500 framebuffer capture, \texttt{shadertoy\_circle.fs.asm}}
\end{figure}

However, in an earlier version of my fragment shader assembler, I produced this
image instead:

\begin{figure}
  \href{images/shadertoy_circle_incorrect.png}{\includegraphics{images/shadertoy_circle_incorrect.png}}
  \caption*{R500 framebuffer capture, \texttt{shadertoy\_circle.fs.asm}\\(incorrect assembler output)}
\end{figure}

In this handwritten fragment shader code:

\begin{figure}
  \href{verbatim/shadertoy_circle.fs.asm}{\includegraphics{verbatim/output/shadertoy_circle.fs.asm.pdf}}
  \caption*{\texttt{shadertoy\_circle.fs.asm}}
\end{figure}

\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} says briefly on page 241:

\begin{quote}
Specifies whether to insert a NOP instruction after this.  This would get
specified in order to meet dependency requirements for the pre-subtract inputs,
and dependency requirements for src0 of an MDH/MDV instruction.
\end{quote}

The issue is the pre-subtract input for the \texttt{MAD |srcp.a| src0.1 -src2.a}
instruction depends on the write to \texttt{temp[0].a} from the immediately
preceding \texttt{RCP src0.a} instruction--a pipeline hazard.

To fix this, I added support for
\href{https://git.idk.st/bilbo/r500/commit/fe0684ca5e58ed3be026410812c042e883bdce71}{generating the \texttt{NOP} bit}
in my fragment shader assembler.

\subsection{More fragment shader assembler bugs}

While trying to produce this image:

\begin{figure}
  \href{images/shadertoy_palette.png}{\includegraphics{images/shadertoy_palette.png}}
  \caption*{R500 framebuffer capture, \texttt{shadertoy\_palette.fs.asm}}
\end{figure}

My fragment shader code instead produced this image:

\begin{figure}
  \href{images/shadertoy_palette_incorrect.png}{\includegraphics{images/shadertoy_palette_incorrect.png}}
  \caption*{R500 framebuffer capture, \texttt{shadertoy\_palette.fs.asm}\\(incorrect assembler output)}
\end{figure}

The issue was simply that in the chaos of all of the other features I was
implementing for my fragment shader assembler, I
\href{https://git.idk.st/bilbo/r500/commit/f6a0fc4fab5dee3085dcf4b9a984244bba05d5ca}{forgot to emit the \texttt{ADDRD} bits}.

This meant that while fragment shader code that exclusively uses zero-address
destinations, such as \texttt{shadertoy\_circle.fs.asm}, appeared to work
completely correctly, I encountered this bug as soon as I started using non-zero
addresses such as \texttt{temp[1]} in my fragment shader code.

\subsection{Comparison to Direct3D ``asm''}

Prior to Direct3D 10, Microsoft previously defined a specification for both
\href{https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx9-graphics-reference-asm-vs-3-0}{vertex shader assembly} and
\href{https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx9-graphics-reference-asm-ps-3-0}{fragment shader assembly}.

The Direct3D ``asm'' name is slightly deceptive, however, as the
\texttt{vs\_3\_0} and \texttt{ps\_3\_0} instruction syntax does not map 1-to-1
with any hardware that exists.

It would perhaps be more accurate to think of Direct3D's ``asm''
language and compiler as more analogous to a
\href{https://en.wikipedia.org/wiki/BASIC}{shader BASIC} than as a true assembly
language on the same level as ``6502 assembly'', ``Z80 assembly'' and similar.

In contrast, my R500 assembly syntaxes are deliberately/explicitly mapped 1-to-1
with R500 instructions.

\subsection{Fragment shader animated demo}

\begin{figure}
  \includegraphics{videos/shadertoy_palette.png}
  \caption*{R500 DVI capture, \texttt{shadertoy\_palette.fs.asm}}
\end{figure}

The R500 fragment shader code that I handwrote for this is:

\begin{figure}
  \href{verbatim/shadertoy_palette.fs.asm}{\includegraphics{verbatim/output/shadertoy_palette.fs.asm.pdf}}
  \caption*{\texttt{shadertoy\_palette.fs.asm}}
\end{figure}

The \texttt{float} constants are interesting--they are decoded almost
identically to the
\href{https://en.wikipedia.org/wiki/Minifloat#8-bit_(1.4.3)}{8-bit (1.4.3) (bias 7) format shown on Wikipedia},
except:
\begin{itemize}
\item There is no sign bit (the value is always positive--positive values
  can be swizzled to produce negative operands)
\item There is no ``zero'' value (zero can also be instead obtained via
  swizzles); the ``all zeros'' bit pattern instead has a value of
  \texttt{0.0009765625}.
\item There are no infinite or not-a-number values: a ``15'' exponent is treated
  as 15.
\end{itemize}

The exponent/mantissa table that shows example 7-bit float values on page 106 of
\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} is incorrect.

\section{Progress: 26 Oct 2025}

From 21 Oct 2025 to 26 Oct 2025, I achieved the following (roughly in chronological order):

\begin{itemize}
\item I \href{https://git.idk.st/bilbo/r500/commit/8594bc4a38f6fcab2ac6e437b46bcf1e0e6d32dd}{rewrote} most of the vertex shader assembler parser/validator, and implemented support for \href{https://git.idk.st/bilbo/r500/commit/f3f1969f4a9b336536f5fb23d246f7103c41e20d}{assembling/disassembling ``dual math'' operations}
\item I implemented support for \href{https://git.idk.st/bilbo/r500/commit/96d7286e7cd3270b9dca0924d3a046d585d6dc9d}{assembling} and \href{https://git.idk.st/bilbo/r500/commit/27227426eaac265bc3126edd7d017c791640e789}{disassembling} TEX fragment shader instructions
\item I presented this project (including live demos on real hardware) at
  a \href{https://itch.io/jam/spoopy-jam-7-heckraiser}{local in-person game jam event}
\end{itemize}

\subsection{Vertex shader optimization part 1: ``MOV'' elimination}

After talking about it in-person, I decided to try to golf my original
15-instruction
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm} vertex shader.

The first opportunity for optimization is in the first two instructions of:

\begin{figure}
  \href{verbatim/cube_rotate_const_move.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move.vs.asm.pdf}}
\end{figure}

The \texttt{VE\_ADD} (being used here as a ``MOV'' instruction) is needed
because there is only a single 128-bit read port into \texttt{const} memory, so
a multiply-add like this is illegal:

\begin{figure}
  \href{verbatim/cube_rotate_const_move_illegal.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_illegal.vs.asm.pdf}}
\end{figure}

I observed that because I never need to reference the last two constants in the
same instruction that references the first two constants, if I rearrange the
ordering of the constants to:

\begin{figure}
  \href{verbatim/cube_rotate_const_move_rearrange.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_rearrange.vs.asm.pdf}}
\end{figure}

I can then rewrite the multiply-add instructions as:

\begin{figure}
  \href{verbatim/cube_rotate_const_move_rearrange_mad.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_rearrange_mad.vs.asm.pdf}}
\end{figure}

\subsection{Vertex shader optimization part 2: ``dual math'' instructions}

I spent an entire day rewriting large portions of the vertex shader assembler to
add support for ``dual math'' instructions.

The original
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm}
contains this sequence of \texttt{ME_SIN}/\texttt{ME\_COS} instructions:

\begin{figure}
  \href{verbatim/cube_rotate_sin_cos.vs.asm}{\includegraphics{verbatim/output/cube_rotate_sin_cos.vs.asm.pdf}}
\end{figure}

The \texttt{temp[3].x} and \texttt{temp[3].y} results are needed immediately,
but \texttt{temp[3].z} and \texttt{temp[3].w} are not needed until after the
first pair of \texttt{VE\_MUL}/\texttt{VE\_MAD} operations.

The dual math instruction mode replaces the 3rd \texttt{VE_} instruction operand
with any \texttt{ME\_} operation, so it is only usable with 2-operand
\texttt{VE\_} instructions like \texttt{VE\_MUL}.

The dual math encoding also has several restrictions (it only has \nicefrac{1}{4}th the
control word bits compared to a normal \texttt{ME\_} instruction). A notable
restriction is that it must write to \texttt{alt\_temp}.

Unlike the fancy things that can be done with fragment shader
operands/sources/swizzles, a single vertex shader operand can also only read
from a single 128-bit register, so this means to be able to continue to access
\texttt{temp[3].zw} as a vector, both \texttt{z} and \texttt{w} must now be
stored in \texttt{alt\_temp}, even if only one of them was written by a ``dual
math'' instruction.

The change (and my newly-implemented dual math syntax) is:

\begin{figure}
  \href{verbatim/cube_rotate_dual_math.vs.asm}{\includegraphics{verbatim/output/cube_rotate_dual_math.vs.asm.pdf}}
\end{figure}

Where the dual math instruction:

\begin{figure}
  \href{verbatim/cube_rotate_dual_math_single_instruction.vs.asm}{\includegraphics{verbatim/output/cube_rotate_dual_math_single_instruction.vs.asm.pdf}}
\end{figure}

Is encoded by the assembler as single instruction and is executed by the vertex
shader unit in a single clock cycle.

The final
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate_optimize.vs.asm}{cube\_rotate\_optimize.vs.asm}
was reduced from 15 instructions to 13 instructions (compared
to Mesa's R500 vertex shader compiler's 27 instructions).

\section{Progress: 29 Oct 2025}

From 27 Oct 2025 to 29 Oct 2025, I achieved the following (roughly in chronological order):

\begin{itemize}
\item I implemented support for \href{https://git.idk.st/bilbo/r500/commit/9aecbbfc6f297ea71c72f4c4fba1b8107be95ca1}{``multiple render targets''} in the fragment shader assembler
\item I wrote a \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/texture_blur_horizontal.fs.asm}{gaussian blur fragment shader}
\item I made a demo that draws \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L963}{multiple 3D ``objects''} where each object's UV coordinates sample a \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L1029-L1069}{different} \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L314}{texture}
\item I did several experiments related to R500's Z-buffer implementation
\end{itemize}

\subsection{Z-buffer experiments}
\label{sec:z-buffer-experiments}
Though I produced a ``properly'' Z-buffered 3D cube demo previously, I felt I
did not fully understand the relationship between Z coordinates, W coordinates,
viewport transformations, and the actual values that are written the the
Z-buffer. At some point, I'd like to write fragment shaders that sample the
Z-buffer, so I feel I need to understand this more rigorously.

For comparison, Sega Dreamcast stores 32-bit floating-point values in the
``depth accumulation buffer''. This effectively means that any Z coordinates can
be stored in the depth accumulation buffer without scaling or range
remapping. I've made several
\href{https://az1.idk.st/public/20kdm2-demo.mp4}{moderately fancy} Dreamcast
demos in that happily store arbitrary ``view space'' Z values in the depth
accumulation buffer without any visible depth aliasing/artifacts.

In contrast, the Radeon R500 does not have a 32-bit floating point Z-buffer
format. Instead, R500 supports (\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}, page 283,
\texttt{ZB\_FORMAT}):

\begin{itemize}
\item 16-bit integer Z
\item 16-bit floating point
\item 24-bit integer Z with 8-bit stencil
\end{itemize}

The third option, with the most bits, clearly ought to give the most
precision--with the caveat that the Z values that are written to the Z-buffer
should be scaled to be uniformly distributed across the range of 24-bit integers.

I performed several tests with variations of
\href{https://git.idk.st/bilbo/r500/src/branch/main/drm/zbuffer_test.c}{zbuffer\_test.c}. The
general strategy was:

\begin{itemize}
\item Define some contrived/illustrative 3D scene
\item Manipulate the scale/range of Z and W values
\item Observe the state of the Z-buffer after rendering
\end{itemize}

The first scene I chose was of a tilted plane that is non-coplanar with the view
space XY plane, as in:

\begin{figure}
  \href{images/plane_scene.png}{\includegraphics{images/plane_scene.png}}
  \caption*{Blender screenshot, ``plane scene''}
\end{figure}

Where the grey plane is the object that is to be rendered, the yellow lines
represent a ``camera'' from which the plane is to be viewed, and the blue line
represents the view/clip-space Z axis.

To view the content of the Z buffer, I wrote a
\href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/tools/zbuf_decode.py}{simple script}
to convert the 24-bit integer Z-buffer to 16-bit
\href{https://en.wikipedia.org/wiki/Netpbm}{PGM},
so that it can be easily viewed in an image editor. This tool also shows the
minimum and maximum values found in the Z-buffer, intended to help verify that
the entire numeric range of the Z-buffer is being used.

While I expected to see the (orthographic, directly facing the camera) plane
drawn on the Z-buffer as a smooth gradient such as:

\begin{figure}
  \href{images/z_buffer_gradient.png}{\includegraphics{images/z_buffer_gradient.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_gradient.png}}
\end{figure}

Several of my tests displayed numeric aliasing, overflows, underflows, etc..:

\begin{figure}
  \href{images/z_buffer_overflow.png}{\includegraphics{images/z_buffer_overflow.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_overflow.png}}
\end{figure}

Of particular interest to me was to verify the behavior of the
\texttt{DX\_CLIP\_SPACE\_DEF} bit
(\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}, page
255--this is also the only place in the entire manual where ``non-user'' clip
planes are even defined), and to understand the order of pipeline operations.

I played with moving the plane around, to observe clipping behavior (here the
lower half of the scene was clipped due to intersecting the Z=+1.0 clip plane):

\begin{figure}
  \href{images/z_buffer_clipped.png}{\includegraphics{images/z_buffer_clipped.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_clipped.png}\\
    (also simultaneously showing overflow/underflow artifacts)}
\end{figure}

Thinking at this point that I nearly understood most of the pieces, I then
re-enabled XY perspective division:

\begin{figure}
  \href{images/z_buffer_perspective.png}{\includegraphics{images/z_buffer_perspective.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_perspective.png}}
\end{figure}

The above image was not quite what I wanted: I noticed the range of the Z buffer
values were roughly between \texttt{0} and \texttt{8388607}, but what I really
wanted was \texttt{0} to \texttt{16777215}. Adjusting scale again produced this
Z-buffer:

\begin{figure}
  \href{images/z_buffer_perspective_scale.png}{\includegraphics{images/z_buffer_perspective_scale.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_perspective\_scale.png}}
\end{figure}

Up to this point, I was using \texttt{ZFUNC=GREATER} with a Z-buffer cleared
with an initial depth of zero, where all Z values are negative numbers.

I decided it might be more intuitive to use a Z-buffer that is cleared with an
initial depth of one, using \texttt{ZFUNC=LESS} instead where all Z values are
positive numbers.

With these adjustments, I captured a Z-buffer from the earlier cube demo:

\begin{figure}
  \href{images/z_buffer_cube.png}{\includegraphics{images/z_buffer_cube.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube.png}}
\end{figure}

This was still not quite ``correct'', because the minimum depth of the cube is
being drawn as \textasciitilde{}\texttt{2763306} (\textasciitilde{}0.16), but I expected
something closer to zero.

Adjusting my range/scale arithmetic again produced this image:

\begin{figure}
  \href{images/z_buffer_cube_range.png}{\includegraphics{images/z_buffer_cube_range.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube\_range.png}}
\end{figure}

The minimum Z value now appears to be closer to zero, but the ``back'' faces of
the cube (and maximum Z values) are not visible. Without changing any
scale/range constants, inverting \texttt{ZFUNC} and using a zero-initialized
Z-buffer produced this image of the back faces of the cube:

\begin{figure}
  \href{images/z_buffer_cube_range_back.png}{\includegraphics{images/z_buffer_cube_range_back.png}}
  \caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube\_range\_back.png}}
\end{figure}

Indeed, the maximum Z value is close to \textasciitilde{}\texttt{16777215}
(\textasciitilde{}1.0), as intended. I feel at this point I have a better intuition
for using integer Z-buffers. The pipeline (and relevant registers) appears to be
something like this:

\begin{figure}
  \includegraphics{diagrams/z_operations.svg}
  \caption*{R500 Z transform pipeline (simplified)}
\end{figure}

Prior to these experiments, I was not aware \texttt{SU\_DEPTH\_SCALE} is the
thing directly responsible for scaling floating point Z values to the integer Z
values stored in the depth buffer.

In general, the hardware perspective divide, viewport transform, clipping, and
setup units are absolutely fascinating.

\subsection{3D perspective}

Despite making many 3D demos in the past, I feel that every time I want to
``draw something 3D'' on a new platform, I need to re-relearn 3D/perspective
transformations, (perhaps because I never truly \textit{learned} anything).

In many OpenGL articles/tutorials/books the
\href{https://learnopengl.com/Getting-started/Coordinate-Systems}{standard}
\href{https://ogldev.org/www/tutorial12/tutorial12.html}{formula} for
\href{https://songho.ca/opengl/gl_projectionmatrix.html}{explaining}
\href{https://www.scratchapixel.com/lessons/3d-basic-rendering/perspective-and-orthographic-projection-matrix/opengl-perspective-projection-matrix.html}{perspective}
\href{https://learnwebgl.brown37.net/08_projections/projections_perspective.html}{projection}
appears to be:

\begin{itemize}
\item Begin with an overly-academic explanation of perspective in terms of camera optics and trigonometry
\item Do not implement or demonstrate the any of the systems or mathematics
  described in the preceding pages of explanations; intead abruptly hide all
  magic behind \texttt{glm::perspective}
\item Refuse to explain or clarify further
\item Continue for the next 30 chapters/articles without ever revisiting focal
  length, view frustums, depth of field, etc.. again
\end{itemize}

It is sufficient to instead rationalize/implement ``perspective'' as:

\begin{quote}
  Perspective is the division of X and Y coordinates by Z, where the coordinate
  $(0, 0, 0)$ is the view origin (and the center of the screen/projection).
\end{quote}

Defining perspective this way also works for OpenGL, with some slight
adjustment, notably to deal with OpenGL's
\href{https://registry.khronos.org/OpenGL/specs/gl/glspec20.pdf}{definition} of
``normalized device coordinates''.

I note that (unlike Dreamcast) one can't actually divide by Z on R500 (nor
OpenGL), both because the VTE doesn't support this, and because the texture
unit doesn't support this. Of course, I tried it anyway:

\begin{figure}
  \includegraphics{videos/cube_warped_textures.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_warping.c} \\
    (unrelated to this demo, R500 also interestingly has a dedicated ``disable perspective-correct texture mapping'' bit)}
\end{figure}

Instead, in both cases, the R500 uses the W coordinate for division. This turns
out to be very convenient, because it means that that the ``field of
view''/perspective scale (W) and the Z-buffer/depth test scale (Z) can be
adjusted independently.

\subsection{3D clipping}

Here are several examples of improperly scaled Z values, which are being clipped
by the setup unit:

\begin{figure}
  \includegraphics{videos/cube_clipped_far.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
  (``far'' clip plane intersection)}
\end{figure}

\begin{figure}
  \includegraphics{videos/cube_clipped_near.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
  (``near'' clip plane intersection)}
\end{figure}

\begin{figure}
  \includegraphics{videos/cube_clipped_near_opengl.png}
  \caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
  (I am curious to learn under what circumstances the OpenGL designers thought\\ $-w_{c} < z_{c} < w_{c}$ was a good idea)}
\end{figure}

\section{Progress: 31 Oct 2025}

From 30 Oct 2025 to 31 Oct 2025, I achieved the following (non-chronological):

\begin{itemize}
\item I implemented a \href{https://git.idk.st/bilbo/r500/src/branch/main/drm/matrix_cubesphere_specular.fs.asm}{diffuse/specular lighting fragment shader} in R500 fragment shader assembly
\item I made vertex shaders that represent coordinate space transformations
  using matrix multiplications rather than ad-hoc arithmetic
\item While writing demos that pass multiple (interpolated) vectors from the
  vertex shader to the fragment shader, I learned more about \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L444-L512}{``rasterizer instructions''}
\item I made a demo that uses more than one texture for the entire scene
  (by \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/pumpkin_man.c#L272-L317}{reconfiguring
  the texture unit for each ``object''})
\end{itemize}

\subsection{Lighting demo}

\begin{figure}
  \includegraphics{videos/suzanne.png}
  \caption*{R500 DVI capture, \texttt{matrix\_cubesphere\_specular\_suzanne.cpp} \\
  (subdivided Suzanne mesh, 15,744 triangles)}
\end{figure}

Despite being a ``simple'' lighting demo, a surprising number of things need to
happen simultaneously before it becomes possible.

Where vertex shaders from previous demos were passed at most a single scalar
variable for animation/timing, the vertex shader in this demo uses
\href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L301-L326}{10 vectors} as
input:

\begin{itemize}
\item 4 vectors for a ``local space to clip space'' transformation matrix
\item 4 vectors for a ``local space to world space'' transformation matrix (used for lighting)
\item 1 vector for a ``light position'' (in world space coordinates, used for lighting)
\item 1 vector for a ``view origin'' (in world space coordinates, used for lighting)
\end{itemize}

Additionally, where previous demos passed at most a single vector from the
vertex shader to the fragment shader (vertex color or texture coordinates), this
demo passes
\href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L444-L512}{5 vectors}
from the vertex shader to the fragment shader, all of which are used
by the lighting calculation:

\begin{itemize}
\item world space position
\item world space normal
\item world space light position
\item world space view origin
\item uv space texture coordinates
\end{itemize}

\subsection{Learn algebra by writing fragment shader assembly}

Prior to today, I did not know about this transformation/equivalence:

\begin{gather*}
x^{n} \iff 2^{\left( n\cdot\frac{\log(x)}{\log(2)} \right)}
\end{gather*}

While the R500 fragment shader alpha unit does not have a \texttt{POW} operation,
it does have \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular.fs.asm#L93-L99}{\texttt{EX2} and \texttt{LN2}}
operations.

For example, one could implement $a^{32}$ in R500 fragment shader assembly as:

\begin{figure}
  \href{verbatim/pow_fragment_shader.fs.asm}{\includegraphics{verbatim/output/pow_fragment_shader.fs.asm.pdf}}
\end{figure}

This ``arbitrary exponents with arbitrary bases'' pattern is used in the
lighting demo fragment shader as part of the ``specular intensity'' calculation.

This fragment shader unit feature is very cool, because a software
implementation of a generalized floating-point \texttt{pow} function is
extremely
\href{https://git.musl-libc.org/cgit/musl/tree/src/math/powf.c?id=cb5c057c87240a9534f8e0d9b7ff2560082f6218}{computationally expensive}
otherwise.

\section{Progress: 11 Nov 2025}

From 1 Nov 2025 to 11 Nov 2025, I achieved the following:

\begin{itemize}
\item I briefly experimented with \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/matrix_cubesphere_cubemap.cpp#L1081-L1088}{cubemap} \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/matrix_cubesphere_cubemap.cpp#L503}{textures}
\item I experimented with point primitives and texture coordinate ``stuffing''
\item I made a demo that generates and uses \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/matrix_cubesphere_tiled.cpp}{macrotiled/microtiled textures}
\item I created a particle system demo where the \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_physics.fs.asm}{particle simulation is computed in a fragment shader}
\item I implemented \href{https://git.idk.st/bilbo/r500/commit/9e281cba583ec4a06e02470310c31cdad6962f64}{support for \texttt{\#include} directives} in my vertex and fragment shader assemblers
\item I used the new \texttt{\#include} feature to more concisely express an \href{https://git.idk.st/bilbo/r500/commit/fdff78f1ad/main/drm/shadertoy_palette_fractal.fs.asm#L26-L29}{unrolled loop}
\end{itemize}

\subsection{Rewriting GLSL ``shadertoy'' shaders as R500 assembly}

I felt \href{https://www.shadertoy.com/view/mtyGWy}{``Shader Art Coding Introduction''}
would be fun to reimplement as R500 fragment shader assembly: it produces
an interesting visual effect, despite not being particularly complicated.

In general when writing assembly programs, I use a few techniques to improve my
productivity and accuracy, prior to writing any actual assembly:

\subsubsection{Expand/rewrite the GLSL code}

In particular, my goal in this step is to make each line of GLSL roughly equal
to one fragment shader instruction. I started by rewriting all of the function
calls the \href{https://www.shadertoy.com/view/mtyGWy}{original} code made as:

\begin{figure}
  \href{verbatim/palette_fractal_functions.fs.glsl}{\includegraphics{verbatim/output/palette_fractal_functions.fs.glsl.pdf}}
\end{figure}

Defining replacements for GLSL's built-in functions is not a good practice for
writing GLSL code in general. However, the goal in this specific situation is to
give myself line-by-line hints on the R500 fragment shader assembly that I'll
eventually need to write.

I then rewrote the \texttt{main} function in a similar ``one line of GLSL per
R500 instruction'' style as:

\begin{figure}
  \href{verbatim/palette_fractal_main.fs.glsl}{\includegraphics{verbatim/output/palette_fractal_main.fs.glsl.pdf}}
\end{figure}

I also decided the multiplication by 0.125, where it normally would have
required a separate multiply-add instruction, was a perfect excuse to
\href{https://git.idk.st/bilbo/r500/commit/90b486e744c14bb23283218108799186162afaad}{implement assembler support}
for the ``OMOD'' R500 fragment shader feature/syntax that I previously
\href{https://git.idk.st/bilbo/r500/commit/8e6e6e9750a33759b51ed73d3e238ebe77ee3f61}{implemented in my fragment shader disassembler}.

\subsubsection{Assign all temporary variables to registers}

Still prior to writing any fragment shader assembly, I then decided where I
would store each GLSL variable in fragment shader temporary/constant memory:

\begin{figure}
  \href{verbatim/palette_fractal_memory.fs.asm}{\includegraphics{verbatim/output/palette_fractal_memory.fs.asm.pdf}}
\end{figure}

I intentionally stored scalar values in the alpha component of each
vector. Given my current fragment shader assembler syntax, this allows for
slightly more improved human-readability. For example, doing a scalar
multiply-add with the alpha unit looks like this:

\begin{figure}
  \href{verbatim/palette_fractal_alpha_mad.fs.asm}{\includegraphics{verbatim/output/palette_fractal_alpha_mad.fs.asm.pdf}}
\end{figure}

If the \texttt{l} variable were instead stored in the green component, the code
would be slightly uglier, as in:

\begin{figure}
  \href{verbatim/palette_fractal_rgb_mad.fs.asm}{\includegraphics{verbatim/output/palette_fractal_rgb_mad.fs.asm.pdf}}
\end{figure}

\subsubsection{Translate the GLSL to R500 fragment shader assembly, line-by-line}

Because the GLSL code was transformed to very closely match the fragment shader
assembly, this also makes it easy to test the fragment shader output when only a
fraction of the complete program is translated (e.g: by commenting out chunks of
the GLSL code to match the current state of the in-progress fragment shader
assembly translation).

The visual appearance of a half-translated varient of this fragment shader is
not intuitive, so this technique greatly improves debuggability. I made at least
at least two mistakes while translating that were not difficult to debug at a
per-instruction level by comparing the equivalent GLSL code's visuals.

\subsubsection{Translate a fixed-length GLSL loop}

Though the R500 does support it, my fragment shader assembler does not currently
implement support for loops (or any other type of flow control).

R500 fragment shader flow control is also relatively expensive compared to
``loop unrolling'', particularly in this case where the loop body is only 32
instructions, and there are only 4 total iterations of the loop body.

For this reason, I decided I wanted a concise and generalized way to ``repeat''
chunks of source code in my fragment shader assembler, without actually
duplicating the text.

To do this, I
\href{https://git.idk.st/bilbo/r500/commit/9e281cba583ec4a06e02470310c31cdad6962f64}{implemented}
an ``\texttt{\#include}'' feature in my fragment shader assembler. This is
conceptually similar to how \texttt{\#include} works in the C programming
language, though my implementation simply feeds tokens from the included file
directly from the (nested) lexer to the parser, rather than the much more
complex procedure used by the C preprocessor.

With this new feature, the translation of the GLSL loop is very simple:

\begin{figure}
  \href{verbatim/palette_fractal_loop.fs.asm}{\includegraphics{verbatim/output/palette_fractal_loop.fs.asm.pdf}}
\end{figure}

The full implementation is committed as
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/shadertoy_palette_fractal.fs.asm}{shadertoy\_palette\_fractal.fs.asm} and
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/shadertoy_palette_fractal_loop_inner.fs.asm}{shadertoy\_palette\_fractal\_loop\_inner.fs.asm}.

\subsubsection{Demo videos}

\begin{figure}
  \includegraphics{videos/shadertoy_fractal.png}
  \caption*{R500 DVI capture, \texttt{shadertoy\_palette\_fractal.fs.asm}\\(variant)}
\end{figure}

\begin{figure}
  \includegraphics{videos/shadertoy_fractal2.png}
  \caption*{R500 DVI capture, \texttt{shadertoy\_palette\_fractal.fs.asm}}
\end{figure}

\subsection{Fragment shader particle simulation}

\subsubsection{Using fragment shaders to render non-pixel data}

ATI documentation \href{doc/R2VB_programming.pdf}{mentioned} the existence of a
``Render to Vertex Buffer'' feature.

The general idea/revelation is:

\begin{itemize}
\item fragment shader output does not need to be ``pixel data''--it can
  arbitrarily be assigned any desired meaning
\item by alternating between a pair of buffers, fragment shader output can be
  used as the input for the next invocation of the same fragment shader
\end{itemize}

The state manipulated by the pixel shader is double-buffered, where each
iteration of the fragment shader uses alternating ``read'' and ``write''
buffers, as in:

\begin{figure}
  \href{diagrams/simplified_particle_data_flow.svg}{\includegraphics{diagrams/simplified_particle_data_flow.svg}}
\end{figure}

On the subsequent iteration of the same computation, state ``b'' would be read
and state ``a'' would be written.

For all prior fragment shader demos, I used the 32-bit \texttt{C4\_8} surface
format:

\begin{figure}
  \href{diagrams/c4_8.pdf}{\includegraphics{diagrams/c4_8.pdf}}
\end{figure}

Where 8-bit unsigned integer representations of blue, green, red, and alpha
could be stored in C0, C1, C2, and C3 respectively (or any other arbitrary
color component ordering).

R500 also supports a 128-bit \texttt{C4\_32\_FP} surface format:

\begin{figure}
  \href{diagrams/c4_32_fp.pdf}{\includegraphics{diagrams/c4_32_fp.pdf}}
\end{figure}

Where each component contains a 32-bit floating point value. Compared to 8-bit
integers, this increase in precision makes the format more useful for
generalized computation.

R500 conveniently also has an equivalent 128-bit per texel, 32-bit floating
point per component, 4-component texture format.

\subsubsection{Particle simulation data model}

I decided a minimal but still ``mildly interesting'' particle system would need
at least the following state:

\begin{figure}
  \href{verbatim/particle_system_data_model.c}{\includegraphics{verbatim/output/particle_system_data_model.c.pdf}}
\end{figure}

\texttt{age} is used to both ``reset'' the particle after some time (allowing
the simulation to repeat indefinitely) and to give the particles non-uniform
reset timing. \texttt{random} is used to further make the behavior of each
particle less uniform. At the start of the particle simulation, all values are
randomly initialized.

This data model requires 8 components in total, which is more than the 4
components provided by both the pixel shader output surface format as well as
the texture sampler texel format. However:

\begin{itemize}
\item R500 fragment shaders can have up to 4 independent render targets
\item R500 fragment shaders can sample from up to 16 independent textures
\end{itemize}

Following this model, it makes sense to break up the data structure like this:

\begin{figure}
  \href{verbatim/particle_system_data_model_split.c}{\includegraphics{verbatim/output/particle_system_data_model_split.c.pdf}}
\end{figure}

Where each fragment samples from two separate texture buffers, and has two
separate render targets as output:

\begin{figure}
  \href{diagrams/simplified_particle_data_flow_split.svg}{\includegraphics{diagrams/simplified_particle_data_flow_split.svg}}
\end{figure}

\subsubsection{Drawing particles}

I decided to draw particles using the R500's ``quad list'' primitive. In a
non-fragment-shader-computed version of my particle simulation demo, I sent the
particle position as a vertex shader constant, as in:

\begin{figure}
  \href{verbatim/particle_system_cpu.cpp}{\includegraphics{verbatim/output/particle_system_cpu.cpp.pdf}}
\end{figure}

The vertex shader is then able to calculate the quad vertex positions using a
vertex shader program that is equivalent to this GLSL code:

\begin{figure}
  \href{verbatim/particle_system_position.glsl}{\includegraphics{verbatim/output/particle_system_position.glsl.pdf}}
\end{figure}

This works reasonably well for small particle system demos where particle
position is calculated on the CPU. However, the goal is to compute (much larger)
particle system positions via the pixel shader, and it would be highly preferred
that the particle system state never leaves R500 VRAM. In the latter case, the
``combine quad position coordinates with particle position coordinates via
vertex shader constants'' approach does not work for several reasons:

\begin{itemize}
  \item R500 constant memory has 256 vectors; I'd like to make particle systems
    with at least 100,000 particles.
  \item The R500 pixel shader is not able to write to vertex shader constant
    memory, it can only write to texture memory.
  \item The \texttt{radeon} Linux kernel module generates a segmentation fault
    in kernel space when given an indirect buffer larger than ~2MB (a Linux
    kernel bug, not a R500 hardware limitation). Including the overhead of
    multiple \texttt{3D\_DRAW\_IMMD} commands and vertex constant transfers,
    100,000 particles is easily larger than 2MB of indirect buffer.
\end{itemize}

The only remaining option is to store particle position coordinates as a vertex
buffer. However, because I am drawing quads, despite the R500 vertex fetcher's
generous flexibility, the particle state buffer can't be used directly by the
vertex shader because it only operates on individual vertices.

For example, if a particle is at position \texttt{(4.0, 5.0, 6.0)}, the data
that needs to be sent to the vertex shader should be:

\begin{figure}
  \href{verbatim/particle_position_vertex_shader_example.c}{\includegraphics{verbatim/output/particle_position_vertex_shader_example.c.pdf}}
\end{figure}

Or, \textit{expressed as a texture}, the desired transformation is:

\begin{figure}
  \href{diagrams/texture_grid.pdf}{\includegraphics{diagrams/texture_grid.pdf}}
\end{figure}

While the R500 pixel shader unit can't itself perform this transformation
directly, the transformation can indeed be achieved by ``scaling'' the particle
state via the R500 setup engine and fragment interpolators using point texture
sampling.

Doing this via point texture sampling is absolutely critical, because a linear
interpolation between the state of two adjacent-in-memory particles is a
completely meaningless operation in this context.

This is implemented in
\href{https://git.idk.st/bilbo/r500/src/branch/main/src/particle_oriented_animated_quad_vbuf_pixel_shader.cpp#L583}{\texttt{\_copy\_to\_vertexbuffer}}
as simply ``rendering'' the particle positions into a viewport that is 4x wider
than the width of the original particle state texture.

\subsubsection{The complete rendering pipeline}

All buffers in the following diagram are entirely stored in R500 texture memory,
and are never transferred to x86 RAM.

\begin{figure}
  \href{diagrams/complete_particle_data_flow.svg}{\includegraphics{diagrams/complete_particle_data_flow.svg}}
\end{figure}

The full rendering pipeline implementation is committed as
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_oriented_animated_quad_vbuf_pixel_shader.cpp}{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}.

The full particle simulation pixel shader implementation is committed as
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_physics.fs.asm}{particle\_physics.fs.asm}.

\subsubsection{Demo video}

Speed comparison of my test system's Pentium 4 CPU and R500 pixel shader
simulating the same particle system (131,072 particles):

\begin{figure}
  \includegraphics{videos/cpu_particle_simulation.png}
  \caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf.cpp}\\(CPU -generated particle system)}
\end{figure}

\begin{figure}
  \includegraphics{videos/pixel_shader_particle_simulation.png}
  \caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}\\(pixel shader -generated particle system)}
\end{figure}

A more colorful variant of the same particle system demo (65,536 particles):

\begin{figure}
  \includegraphics{videos/pixel_shader_particle_simulation_color.png}
  \caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}\\(pixel shader -generated particle system)}
\end{figure}

It is exciting for me to realize that this ``perform generalized computations
via R500 pixel shaders'' technique has myriad other possible applications.

\end{document}