1688 lines
75 KiB
TeX
1688 lines
75 KiB
TeX
\documentclass[20pt]{article}
|
|
|
|
\usepackage{amsmath}
|
|
\usepackage[font=small,labelfont=bf]{caption}
|
|
\usepackage{hyperref}
|
|
\hypersetup{
|
|
colorlinks=true,
|
|
linkcolor=blue,
|
|
filecolor=magenta,
|
|
urlcolor=cyan,
|
|
pdftitle={Dreamcast},
|
|
pdfpagemode=FullScreen,
|
|
}
|
|
|
|
\usepackage{graphicx}
|
|
\graphicspath{ {./images/} }
|
|
|
|
\usepackage{minted}
|
|
\usepackage{nicefrac}
|
|
|
|
\title{Radeon R500}
|
|
\date{}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
\href{images/x1950xt.jpg}{\includegraphics{images/x1950xt.jpg}}
|
|
|
|
\tableofcontents
|
|
|
|
\section{Introduction}
|
|
|
|
The primary/minimal project goal is ``draw a triangle on a Radeon R500 via
|
|
direct memory-mapped hardware register and texture memory accesses''. This means
|
|
no \href{https://mesa3d.org/}{Mesa}, no
|
|
\href{https://github.com/torvalds/linux/tree/v6.12/drivers/gpu/drm/radeon}{radeon}
|
|
kernel module, and certainly no OpenGL or Direct3D.
|
|
|
|
I have worked directly with several other graphics units in the past
|
|
(\href{https://github.com/buhman/saturn-examples}{Saturn VDP1},
|
|
\href{https://github.com/buhman/dreamcast}{Dreamcast Holly},
|
|
\href{https://github.com/buhman/voodoo}{Voodoo 2}). In all of these projects,
|
|
my strategy is generally:
|
|
|
|
\begin{itemize}
|
|
\item read the entire \href{doc/R5xx_Acceleration_v1.5.pdf}{reference
|
|
documentation} at least once, front-to-back
|
|
\item copy all hardware register definitions from the documentation to a
|
|
spreadsheet or text file (sometimes typing everything by hand if I am in such
|
|
a chill mood)
|
|
\item progressively build increasingly-complex example programs that exercise
|
|
the hardware
|
|
\end{itemize}
|
|
|
|
The rabbit hole for R500 seems significantly deeper, considering this is the
|
|
first graphics unit I've worked with that has programmable vertex and pixel
|
|
shader engines.
|
|
|
|
\subsection{Hardware}
|
|
|
|
For testing, I currently have this hardware configuration:
|
|
|
|
\begin{itemize}
|
|
\item ASUS P4B-LX (Intel 845) motherboard
|
|
\item Intel Pentium 4 2.6GHz SL6PP (Northwood)
|
|
\item 1024 MB RAM
|
|
\item 32GB PATA SSD
|
|
\item ATI Radeon X1650 PRO 512MB AGP
|
|
\end{itemize}
|
|
|
|
I also have the X1950 XT PCIe shown in the photo, which amazingly has never been
|
|
used, and prior to the photo was sealed in an antistatic bag from manufacture to
|
|
now.
|
|
|
|
\subsection{Test setup}
|
|
|
|
While in my other (video game console) projects I typically insist on
|
|
``bare-metal'' development with no operating system or third-party library
|
|
running on the target hardware, my experience with x86 is much more limited.
|
|
|
|
While it is something I am interested in doing, I believe creating a
|
|
zero-dependency ``code upload'' mechanism for an x86-pc that does not depend on
|
|
an operating system would severely delay my progress on R500-specific work.
|
|
|
|
For my initial exploration of R500, I will instead be manipulating the hardware
|
|
primarily from Linux kernel space. This Linux kernel code does not actually
|
|
meaningfully depend on Linux APIs beyond calling \texttt{ioremap} to get usable
|
|
memory mappings for R500 PCI resources (texture/framebuffer memory and
|
|
registers).
|
|
|
|
\section{Progress: 07 Oct 2025}
|
|
|
|
From 01 Oct 2025 to 07 Oct 2025, I achieved the following:
|
|
|
|
\begin{itemize}
|
|
\item I wrote a reasonably complete AtomBIOS disassembler
|
|
\item I can disable (IBM PC) VGA mode and manipulate the native framebuffer
|
|
\item I can upload microcode to the ``command processor'', and I can write to
|
|
scratch registers via command processor packets (this is uncoincidentally the
|
|
same command processor test that the radeon kernel module does).
|
|
\item I stepped through Mesa functions as invoked by a simple OpenGL
|
|
application, and created \href{mesa/glDrawArrays.txt}{a list of R500
|
|
registers/values} that are written by Mesa during \texttt{glDrawArrays}.
|
|
\end{itemize}
|
|
|
|
I did not achieve the following:
|
|
|
|
\begin{itemize}
|
|
\item I attempted to manipulate the R500 register state and command processor
|
|
into drawing a triangle, but I have not been successful yet
|
|
\end{itemize}
|
|
|
|
\subsection{Documentation}
|
|
|
|
In general, I note that the R500 documentation is significantly weaker than I
|
|
hoped, and does not contain enough information to draw a triangle on the R500
|
|
from the documentation alone (with no prior knowledge about previous Radeon
|
|
graphics units).
|
|
|
|
In addition to the lack of prose, in several cases I've noticed both Mesa and
|
|
Linux reference R500 registers that are
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/pci/undocumented_3d_registers.h}{not
|
|
present at all} in the documentation.
|
|
|
|
\subsection{AtomBIOS}
|
|
|
|
AtomBIOS physically exists as a section inside the ROM on R500 graphics units.
|
|
AtomBIOS is notably used for setting PLL/pixel clock frequencies and display
|
|
resolutions, among several other functions.
|
|
|
|
The Radeon graphics hardware itself does not execute AtomBIOS code--instead, it
|
|
is expected that the host (e.g: x86) CPU evaluate the instructions in the
|
|
AtomBIOS command tables. Generally the outcome of evaluating AtomBIOS code is
|
|
that several ``register write'' instructions will be executed, changing the
|
|
state of the graphics unit.
|
|
|
|
My original goal in studying AtomBIOS was that I thought I would need it to set
|
|
up the R500 display controller to a reasonable state (as a prerequisite for
|
|
drawing 3D graphics). However, after actually experimenting with ``disable VGA
|
|
mode'', I currently believe that I don't actually need to implement
|
|
resolution/mode changes, and can proceed without it.
|
|
|
|
\subsection{PIO mode}
|
|
|
|
The Linux kernel exclusively communicates with R500 via ``PCI bus mastering''.
|
|
A ``ring buffer'' is allocated in ``GTT'' space, which from the graphics unit's
|
|
perspective exists in the same address space as framebuffer memory, but is an
|
|
address that is outside the framebuffer memory that physically exists.
|
|
|
|
I also observed via debugfs that the GTT apparently involves some sort of sparse
|
|
page mapping, but I don't understand how this works from an x86 perspective.
|
|
|
|
In the absence of an understanding of how to make my own ``GTT'' address space,
|
|
I attempted to operate the R500 in ``PIO'' mode. This has the advantage of being
|
|
able to simply write to registers via (simple) PCI memory-mapped accesses, but
|
|
it has the disadvantage that Linux doesn't use R500 this way, so I have no
|
|
reference implementation for how PIO mode should be used.
|
|
|
|
\subsection{Triangle drawing attempt \#1}
|
|
|
|
I translated my \href{mesa/glDrawArrays.txt}{glDrawArrays notes} to
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/b6472e4c16946f44e02d82f31adaa411df009c67/pci/triangle.c}{equivalent
|
|
register writes}.
|
|
|
|
This does not work, and I don't yet understand why. The main issue is that most
|
|
of the time when I execute that code, Linux appears to ``hang'' completely, and
|
|
my ``printk'' messages are never sent over ssh. On the rare occasion when the
|
|
``hang'' does not occur, a triangle is nevertheless not drawn on the
|
|
framebuffer.
|
|
|
|
I have a few ideas for how to proceed:
|
|
|
|
\begin{itemize}
|
|
\item Move the ``triangle.c'' register accesses to userspace via
|
|
\texttt{/sys/bus/pci}, which might improve debuggability
|
|
\item Abandon the ``write a kernel module'' idea completely, and instead
|
|
interact with the R500 via \href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_drv.c#L565-L577}{radeon DRM ioctls}
|
|
\end{itemize}
|
|
|
|
The latter is perhaps both the most attractive, and the most work. I currently
|
|
don't have any understanding of GEM buffers, radeon buffer objects, etc.., so
|
|
I'd need to study these in more detail.
|
|
|
|
\section{Progress: 14 Oct 2025}
|
|
|
|
From 08 Oct 2025 to 14 Oct 2025, I achieved the following:
|
|
|
|
\begin{itemize}
|
|
\item I studied how Mesa interacts with the \texttt{radeon} kernel module via
|
|
\texttt{DRM\_RADEON\_} ioctls.
|
|
\item I wrote simple R500 \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/pvs_disassemble.py}{vertex shader} and \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/us_disassemble.py}{pixel shader} disassemblers.
|
|
\item I wrote a \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/parse_packets.py}{tool} to print R500 ``PM4'' packets in human-readable form.
|
|
\item I laboriously \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs/bits}{copied and reformatted} all bit definitions from \href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}
|
|
\item I wrote \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/regs}{several other miscellaneous tools} related to register and bit parsing and manipulation.
|
|
\item I wrote two \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/single_color.c}{humble} \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/vertex_color.c}{demos} to draw a triangle on R500.
|
|
\end{itemize}
|
|
|
|
\subsection{Radeon DRM}
|
|
|
|
As implied in the last update, primarily due to my lack of experience with
|
|
bare-metal x86, I decided it would be a better approach to interact with R500
|
|
Command Processor via the \texttt{radeon} kernel module, which provides a
|
|
partially reasonable interface for this via the \texttt{DRM\_RADEON\_CS} ioctl.
|
|
|
|
All \texttt{DRM\_RADEON\_} ioctls are mostly or entirely undocumented. Instead,
|
|
I built debugging symbols for Mesa and other supporting libraries so that I
|
|
could set breakpoints in GDB to observe what sequences of \texttt{DRM\_RADEON\_}
|
|
ioctls Mesa uses.
|
|
|
|
From my previous \href{mesa/glDrawArrays.txt}{glDrawArrays notes} observations,
|
|
I noticed this strange sequence:
|
|
|
|
\begin{verbatim}
|
|
0x0000138a // type 0 packet, count=0, starting offset = RB3D_COLOROFFSET0
|
|
0x00000000 // RB3D_COLOROFFSET0 = 0
|
|
0xc0001000 // type 3 packet, count=0, opcode=NOP
|
|
0x00000000 // zero (meaningless data)
|
|
\end{verbatim}
|
|
|
|
At first, it seemed Mesa was deliberately setting the colorbuffer write address
|
|
to (VRAM address) zero, which seemed like a strange choice considering I am
|
|
debugging an X11/GLX OpenGL application--surely the colorbuffer address would be
|
|
some non-zero value several megabytes after the beginning of VRAM.
|
|
|
|
I later attempted to send my own PM4 packet via \texttt{DRM\_RADEON\_CS}. This
|
|
initial attempt returned \texttt{Invalid argument}, with the following
|
|
message in dmesg:
|
|
|
|
\begin{verbatim}
|
|
[ 1205.978993] [drm:radeon_cs_packet_next_reloc [radeon]] *ERROR* No packet3 for relocation for packet at 14.
|
|
[ 1205.979427] [drm] ib[14]=0x0000138E
|
|
[ 1205.979433] [drm] ib[15]=0x00C00640
|
|
[ 1205.979437] [drm:r300_packet0_check [radeon]] *ERROR* No reloc for ib[13]=0x4E28
|
|
[ 1205.979545] [drm] ib[12]=0x0000138A
|
|
[ 1205.979548] [drm] ib[13]=0x00000000
|
|
[ 1205.979553] [drm:radeon_cs_ioctl [radeon]] *ERROR* Invalid command stream !
|
|
\end{verbatim}
|
|
|
|
This error message comes from
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L664-L669}{drm/radeon/r300.c}.
|
|
|
|
The meaningless data following the type-3 NOP packet is used by the kernel to
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L875-L889}{index}
|
|
the \texttt{DRM\_RADEON\_CS} ``relocs'' array (an array of GEM buffer handles).
|
|
|
|
It seems perhaps the design goal was to never expose the VRAM address of GEM
|
|
buffers to userspace (indeed there seems to be no way to retrieve that via any
|
|
GEM ioctls). This restriction is slightly disappointing, as I would have
|
|
preferred to be able to send unmodified packet data to the R500.
|
|
|
|
However, at the moment this does not appear to be a significant issue, as a
|
|
relatively small number of registers are modified by the Linux kernel's packet
|
|
parser prior creating the indirect buffer that is actually sent to the R500
|
|
hardware.
|
|
|
|
\subsection{Indirect buffers}
|
|
|
|
There appears to be a lot of memory-to-memory copying in the
|
|
Linux/Mesa/DRM/GEM/radeon graphics stack:
|
|
|
|
\begin{itemize}
|
|
\item Mesa writes the OpenGL state to various internal structures
|
|
\item Mesa \href{https://gitlab.freedesktop.org/mesa/mesa/-/blob/25.0/src/gallium/drivers/r300/r300_emit.c?ref_type=heads}{copies} OpenGL state to packet commands in a userspace buffer
|
|
\item Mesa
|
|
\href{https://gitlab.freedesktop.org/mesa/mesa/-/blob/25.0/src/gallium/winsys/radeon/drm/radeon_drm_cs.c?ref_type=heads#L486-487}{passes
|
|
the address} of the userspace buffer to the kernel via
|
|
\texttt{DRM\_RADEON\_CS}
|
|
\item Linux
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L340-L358}{copies
|
|
the entire userspace buffer} to kernel space (calling kvmalloc/kvfree on
|
|
each ioctl)
|
|
\item The \texttt{radeon\_cs\_parser} parses and modifies the buffer originally
|
|
generated by Mesa
|
|
\item \href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/radeon_cs.c#L613}{radeon\_cs\_ib\_fill} copies the parser result to gpu address space.
|
|
\end{itemize}
|
|
|
|
Eventually,
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r100.c#L3709-L3722}{r100\_ring\_ib\_execute}
|
|
is called, which writes the indirect buffer address (now in GPU address space)
|
|
to the ring.
|
|
|
|
It would be interesting to experiment with writing a packet buffer directly in
|
|
GPU/GTT address space (from Linux userspace), with zero copies. This would
|
|
require an entirely new set of ioctls.
|
|
|
|
\subsection{Triangle drawing attempt \#2}
|
|
|
|
These images were never drawn on-screen. I extracted them from VRAM via
|
|
\texttt{/sys/kernel/debug/radeon\_vram}.
|
|
|
|
\begin{figure}
|
|
\href{images/single_color_macrotiled.png}{\includegraphics{images/single_color_macrotiled.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{single\_color.c}}
|
|
\end{figure}
|
|
|
|
Though I was not aware of it yet, the above image was indeed my triangle, and
|
|
\texttt{COLORPITCH0} was merely in ``macrotiled'' mode. Once I realized this, I
|
|
produced this image (still in off-screen VRAM):
|
|
|
|
\begin{figure}
|
|
\href{images/single_color.png}{\includegraphics{images/single_color.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{single\_color.c}}
|
|
\end{figure}
|
|
|
|
This \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/single_color.c}{``single color''} demo deliberately uses the very simple vertex and fragment
|
|
shaders:
|
|
|
|
\begin{figure}
|
|
\begin{verbatim}
|
|
instruction[0]:
|
|
0x00f00203 dst: VE_ADD out[0].xyzw
|
|
0x00d10001 src0: input[0].xyzw
|
|
0x01248001 src1: input[0].0000
|
|
0x01248001 src2: input[0].0000
|
|
\end{verbatim}
|
|
\caption*{R500 vertex shader (1 instruction, 128-bit control word)}
|
|
\end{figure}
|
|
|
|
This vertex shader is doing the equivalent of:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/vertex_shader_equivalent_single_color.glsl}{\includegraphics{verbatim/output/vertex_shader_equivalent_single_color.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
The W component \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae//drm/single_color.c#L339}{comes from}
|
|
\texttt{VAP\_PROG\_STREAM\_CNTL\_EXT\_\_SWIZZLE\_SELECT\_W\_0(5)}, which
|
|
swizzles W to a constant \texttt{1.0}, despite W not being present in the vertex
|
|
data.
|
|
|
|
\begin{figure}
|
|
\begin{verbatim}
|
|
instruction[0]:
|
|
0x00078005 OUT RGBA
|
|
0x08020080 RGB ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
|
|
0x08020080 ALPHA ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
|
|
0x1c9b04d8 RGB_SEL_A=src0.110 RGB_SEL_B=src0.110 TARGET=A
|
|
0x1c810003 ALPHA_OP=OP_MAX ALPHA_SEL_A=src0.0 ALPHA_SEL_B=src0.0 TARGET=A
|
|
0x00000005 RGB_OP=OP_MAX
|
|
\end{verbatim}
|
|
\caption*{R500 fragment shader (1 instruction, 192-bit control word)}
|
|
\end{figure}
|
|
|
|
This fragment shader is doing the equivalent of:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/fragment_shader_equivalent_single_color.glsl}{\includegraphics{verbatim/output/fragment_shader_equivalent_single_color.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
via the src swizzles. I think it is interesting that there are so many options
|
|
for producing inline constants within the fragment shader.
|
|
|
|
The ``target'' fragment shader field also seems interesting. I am excited to
|
|
write shaders that use multiple output buffers.
|
|
|
|
\subsection{DRM/KMS/GBM}
|
|
|
|
These renders were not displayed on-screen, so I looked for ways to correct
|
|
this.
|
|
|
|
Perhaps the most obvious method would be to write to the display controller
|
|
registers (\texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS}) via
|
|
\texttt{RADEON\_DRM\_CS}. However, this does not work due to the command parser
|
|
anti-fun implemented in
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L643}{r300\_packet0\_check}:
|
|
any register not present in that case statement is considered invalid, and the
|
|
packet buffer is not submitted.
|
|
|
|
I attempted to do this the ``right way'' via the DRM/KMS/GBM APIs. I then
|
|
learned that this does not behave correctly on my R500 because demos that wait
|
|
for the flag returned by \texttt{DRM\_IOCTL\_MODE\_PAGE\_FLIP} hang forever.
|
|
|
|
I noticed this earlier on Xorg/GLX as well, as I have been using the
|
|
\texttt{vblank\_mode=0} environment variable to avoid hanging forever in
|
|
\texttt{glXSwapBuffers}. This appears to be a Linux kernel bug, but I didn't
|
|
investigate this further.
|
|
|
|
\subsection{On-screen drawing}
|
|
|
|
I noticed in \texttt{/sys/kernel/debug/radeon\_vram\_mm} that the Linux console
|
|
is only using a single framebuffer (and does not double-buffer).
|
|
|
|
This is fortunate, because this means I can simply
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/pci_user/main.c#L48}{mmap
|
|
the register address space} and write
|
|
\texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} myself without worrying about the
|
|
Linux console overwriting my change. I observed the \texttt{0x813000} value from
|
|
\texttt{/sys/kernel/debug/radeon\_vram\_mm}--there appears to be no other way to
|
|
get the vram address of a GEM buffer.
|
|
|
|
This is ``good enough'' for now, though at some point I'll want to learn how to
|
|
do proper vblank-synchronized double buffering.
|
|
|
|
\subsection{Triangle drawing attempt \#3}
|
|
|
|
I felt the next logical step was to learn how attributes and constants are
|
|
passed through the shader pipeline, so I then \href{https://git.idk.st/bilbo/r500/src/commit/95e9ba85ae/drm/vertex_color.c}{created a demo} that produced this image (this time also displayed on-screen):
|
|
|
|
\begin{figure}
|
|
\href{images/vertex_color.png}{\includegraphics{images/vertex_color.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{vertex\_color.c}}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\begin{verbatim}
|
|
instruction[0]:
|
|
0x00702203 dst: VE_ADD out[1].xyz_
|
|
0x01d10021 src0: input[1].xyz_
|
|
0x01248021 src1: input[1].0000
|
|
0x01248021 src2: input[1].0000
|
|
instruction[1]:
|
|
0x00f00203 dst: VE_ADD out[0].xyzw
|
|
0x01510001 src0: input[0].xyz1
|
|
0x01248001 src1: input[0].0000
|
|
0x01248001 src2: input[0].0000
|
|
\end{verbatim}
|
|
\caption*{R500 vertex shader (2 instructions, 128-bit control words)}
|
|
\end{figure}
|
|
|
|
This vertex shader is doing the equivalent of
|
|
|
|
\begin{figure}
|
|
\href{verbatim/vertex_shader_equivalent_vertex_color.glsl}{\includegraphics{verbatim/output/vertex_shader_equivalent_vertex_color.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
The extra vertex input is fed to the vertex shader via changes to
|
|
\texttt{VAP\_PROG\_STREAM\_CNTL\_0},
|
|
\texttt{VAP\_PROG\_STREAM\_CNTL\_EXT\_0}. Based on my currently limited
|
|
understanding, it seems that arranging the vertex data like this:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/vap_prog_stream_vertices.c}{\includegraphics{verbatim/output/vap_prog_stream_vertices.c.pdf}}
|
|
\end{figure}
|
|
|
|
Is easier to deal with in \texttt{VAP\_PROG\_STREAM\_CNTL} than:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/vap_prog_stream_vertices2.c}{\includegraphics{verbatim/output/vap_prog_stream_vertices2.c.pdf}}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\begin{verbatim}
|
|
instruction[0]:
|
|
0x00078005 OUT RGBA
|
|
0x08020000 RGB ADDR0=temp[0] ADDR1=0.0 ADDR2=0.0
|
|
0x08020080 ALPHA ADDR0=0.0 ADDR1=0.0 ADDR2=0.0
|
|
0x1c440220 RGB_SEL_A=src0.rgb RGB_SEL_B=src0.rgb TARGET=A
|
|
0x1cc18003 ALPHA_OP=OP_MAX ALPHA_SEL_A=src0.1 ALPHA_SEL_B=src0.1 TARGET=A
|
|
0x00000005 RGB_OP=OP_MAX
|
|
\end{verbatim}
|
|
\caption*{R500 fragment shader (1 instruction, 192-bit control word)}
|
|
\end{figure}
|
|
|
|
This fragment shader is doing the equivalent of:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/fragment_shader_equivalent_vertex_color.glsl}{\includegraphics{verbatim/output/fragment_shader_equivalent_vertex_color.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
The \texttt{temp} input appears to be written by
|
|
\texttt{VAP\_OUT\_VTX\_FMT\_0\__VTX\_COLOR\_0\_PRESENT} and read due to the
|
|
changes to \texttt{RS\_COUNT} and \texttt{RS\_INST\_0}.
|
|
|
|
\section{Progress: 21 Oct 2025}
|
|
|
|
From 15 Oct 2025 to 21 Oct 2025, I achieved the following (roughly in chronological order):
|
|
|
|
\begin{itemize}
|
|
\item I learned how the vertex fetcher is \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/vertex_color_aos.c#L387-L401}{configured}
|
|
\item I learned how the ``point list'' drawing primitive can be used to \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear.c#L504}{clear the screen}
|
|
\item I invented a new syntax for R500 vertex shader assembly (ATI never specified one themselves)
|
|
\item I modified my R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/pvs_disassemble.py}{vertex shader disassembler} to emit this new vertex shader syntax
|
|
\item I wrote a R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/vs}{vertex shader assembler} that can process my vertex shader assembly syntax
|
|
\item I create several animated demos with \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate_vblank.c#L849-L859}{vblank-synchronized double buffering}
|
|
\item I learned how to configure and draw (multi-)textured triangles
|
|
\item I learned how to configure, clear, and use Z-buffers
|
|
\item I made a \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/texture_cube_clear_zwrite_vertex_shader.c}{textured rotating cube demo} that uses my first non-trivial \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/cube_rotate.vs.asm}{handwritten vertex shader assembly program}
|
|
\item I invented a new syntax for R500 fragment shader assembly (ATI never specified one themselves)
|
|
\item I wrote a new R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/us_disassemble2.py}{fragment shader disassembler} that emits this new fragment shader syntax
|
|
\item I wrote a R500 \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/fs}{fragment shader assembler} that can process my fragment shader assembly syntax
|
|
\item I wrote a ``shadertoy''-style demo that uses my first non-trivial \href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/shadertoy_palette.fs.asm}{handwritten fragment shader assembly program}
|
|
\end{itemize}
|
|
|
|
\subsection{DRM\_RADEON\_CS state tracking}
|
|
|
|
While attempting refactor one of my R500 demos to send fewer registers per
|
|
\texttt{DRM\_RADEON\_CS} ioctl, I found that there is a ``state tracker'' within
|
|
the \texttt{drm/radeon/r100}. For example, even if you don't use or depend on a
|
|
Z-buffer, \texttt{DRM\_RADEON\_CS} will still reject your packet buffer
|
|
depending on its own (imagined) concept of what the GPU state is. For example:
|
|
|
|
\begin{verbatim}
|
|
[ 1614.729278] [drm:r100_cs_track_check [radeon]] *ERROR* [drm] No buffer for z buffer !
|
|
[ 1614.729626] [drm:radeon_cs_ioctl [radeon]] *ERROR* Invalid command stream !
|
|
\end{verbatim}
|
|
|
|
This happens because \texttt{track->z\_enabled} is
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r100.c#L2435}{initially
|
|
true} at the start of a \texttt{DRM\_RADEON\_CS} ioctl, and does not become
|
|
false unless the packet buffer
|
|
\href{https://github.com/torvalds/linux/blob/v6.12/drivers/gpu/drm/radeon/r300.c#L836-L843}{contains
|
|
a write} to \texttt{ZB\_CNTL}.
|
|
|
|
This seems a bit heavy-handed. Even if the model were ``multiple applications
|
|
may be using the GPU, so a single application can't depend on previously-set
|
|
register state'', it would still be better if the kernel didn't try to enforce
|
|
this by restricting permissible content of a packet buffer.
|
|
|
|
\subsection{Vertex transform bypass}
|
|
|
|
Mesa uses a ``point'' 3D primitive to implement \texttt{glClear} on R500. It
|
|
does this by first uploading this vertex shader:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/mesa_glclear.vs.asm}{\includegraphics{verbatim/output/mesa_glclear.vs.asm.pdf}}
|
|
\caption*{\texttt{mesa\_glclear.vs.asm}}
|
|
\end{figure}
|
|
|
|
This shader does nothing to the input other than copy it to the output, where
|
|
\texttt{out[0]} is the position vector, and \texttt{out[1]} is sent to the
|
|
fragment shader as a ``texture coordinate''. That fragment shader, in turn, does
|
|
not use the texture coordinate:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/mesa_glclear.fs.asm}{\includegraphics{verbatim/output/mesa_glclear.fs.asm.pdf}}
|
|
\caption*{\texttt{mesa\_glclear.fs.asm}}
|
|
\end{figure}
|
|
|
|
In my ``clear''
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_rotate_vblank.c#L539}{implementation},
|
|
I instead set \texttt{PVS\_BYPASS}, which ``bypasses'' the vertex shader
|
|
completely, sending the vertices directly to the rasterizer. This is convenient
|
|
because it obviates the need to upload/change vertex shaders just to clear the
|
|
color and Z -buffers.
|
|
|
|
\subsection{Animation attempt \#1}
|
|
|
|
With a working colorbuffer clear, I wrote the
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate.c#L786}{single\_color\_clear\_translate.c}
|
|
demo to translate my triangle position coordinates in a loop that waits for
|
|
\texttt{DRM\_RADEON\_GEM\_WAIT\_IDLE} between each frame. This attempt
|
|
produced the following images:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/single_color_clear_translate.png}
|
|
\caption*{R500 DVI capture, \texttt{single\_color\_clear\_translate.c}}
|
|
\end{figure}
|
|
|
|
This was intended to be a smooth animation, yet it is not. It also seems several
|
|
frames are never being displayed--the translation step is much smaller than what
|
|
is shown in the video.
|
|
|
|
This, interestingly, is exactly identical to how OpenGL/GLX applications behave
|
|
on R500 with \texttt{vblank\_mode=0}.
|
|
|
|
\subsection{Animation attempt \#2}
|
|
|
|
I read the R500 display controller \href{doc/RRG-216M56-03oOEM.pdf}{register reference guide} again.
|
|
It appears to suggest the \texttt{D1CRTC\_UPDATE\_INSTANTLY} bit, when unset,
|
|
might cause changes to \texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} to be delayed in
|
|
hardware until the next vertical blanking interval begins.
|
|
|
|
This can be combined with polling \texttt{D1GRPH\_SURFACE\_UPDATE\_PENDING} to
|
|
later determine when the vblank-synchronized frame change actually occured.
|
|
|
|
This is precisely what I implemented in
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/single_color_clear_translate_vblank.c#L854-L855}{single\_color\_clear\_translate\_vblank.c}:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/single_color_clear_translate_vblank.png}
|
|
\caption*{R500 DVI capture, \texttt{single\_color\_clear\_translate\_vblank.c}}
|
|
\end{figure}
|
|
|
|
This is much closer to what I intended. The
|
|
\texttt{D1GRPH\_SURFACE\_UPDATE\_PENDING} part is certainly working as I
|
|
expected. Setting/unsetting \texttt{D1CRTC\_UPDATE\_INSTANTLY} appears to have
|
|
no effect on \texttt{D1GRPH\_PRIMARY\_SURFACE\_ADDRESS} behavior, so I feel my
|
|
understanding of R500 double-buffering is still incomplete.
|
|
|
|
\subsection{Multiple-texture sampling}
|
|
|
|
I am amazed and delighted how simple multiple-texture sampling is on R500.
|
|
|
|
As a counter-example, while Sega Dreamcast does have a fairly capable
|
|
fixed-function blending unit, to use the blending unit with multiple-texture
|
|
sampled polygons one needs to render the polygon multiple times (at least once
|
|
per texture) to an accumulation buffer. Blending is then performed between the
|
|
currently-sampled texture and the previously-accumulated result, and the blend
|
|
result is written to the accumulation buffer. From a vertex transformation
|
|
perspective, it can be inconvenient/inefficient to be required to buffer entire
|
|
triangle strips so that they can be submitted more than once per frame without
|
|
duplicating the clip/transform computations.
|
|
|
|
This is the fragment shader for
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/drm/texture_dual.c}{texture\_dual.c}
|
|
(disassembly of code originally generated by Mesa):
|
|
|
|
\begin{figure}
|
|
\href{verbatim/texture_dual.fs.asm}{\includegraphics{verbatim/output/texture_dual.fs.asm.pdf}}
|
|
\caption*{\texttt{texture\_dual.fs.asm}}
|
|
\end{figure}
|
|
|
|
This pre-subtract multiply-add is an algebraic rearrangement of this GLSL code:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/texture_dual.fs.glsl}{\includegraphics{verbatim/output/texture_dual.fs.glsl.pdf}}
|
|
\caption*{\texttt{texture\_dual.fs.glsl}}
|
|
\end{figure}
|
|
|
|
Which produces this image:
|
|
|
|
\begin{figure}
|
|
\href{images/texture_dual.png}{\includegraphics{images/texture_dual.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{texture\_dual.c}}
|
|
\end{figure}
|
|
|
|
Being able to manipulate the texture samples as fragment shader unit temporaries
|
|
rather than as a sequence of accumulation buffer operations has me feeling excited
|
|
to do more with this.
|
|
|
|
\subsection{Z-buffer clear}
|
|
|
|
I've never worked with traditional Z-buffers before--Sega Saturn uses
|
|
\href{https://en.wikipedia.org/wiki/Painter\%27s_algorithm}{painter's algorithm}
|
|
exclusively, and Sega Dreamcast uses a ``depth accumulation buffer''
|
|
that isn't directly readable/writable.
|
|
|
|
It is slightly obvious in retrospect, but it took me several minutes to realize
|
|
that a ``depth clear'' can be implemented by covering the entire screen with a
|
|
``point'' primitive with the desired initial depth while \texttt{ZFUNC} set to
|
|
\texttt{ALWAYS}.
|
|
|
|
\subsection{Drawing a 3D cube}
|
|
|
|
With working double-buffering, Z-buffering, and the ability to clear each of
|
|
these every frame, I felt I was finally ready to draw something ``3D''.
|
|
|
|
I thought it would be fun to first start with a cube that is transformed in
|
|
``software'' on the x86 CPU (not using a vertex shader). This sequence of videos
|
|
shows my progression on implementing this:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/texture_cube.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube.c}}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/texture_cube_clear.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear.c}}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/texture_cube_clear_zwrite.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite.c}}
|
|
\end{figure}
|
|
|
|
\subsection{Drawing a 3D cube with vertex shaders}
|
|
|
|
I then decided it would be fun to hand-write a ``3D rotation'' vertex shader
|
|
from scratch. I first implemented the rotation in GLSL:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate.vs.glsl}{\includegraphics{verbatim/output/cube_rotate.vs.glsl.pdf}}
|
|
\caption*{\texttt{cube\_rotate.vs.glsl}}
|
|
\end{figure}
|
|
|
|
\subsubsection{Remapping shader unit sin/cos operands}
|
|
|
|
Because this shader program depends on being able to calculate sin and cos, this
|
|
meant I immediately needed to understand how to use the \texttt{ME\_SIN} and
|
|
\texttt{ME\_COS} operations.
|
|
|
|
The R500 vertex shader ME unit clamps sin/cos operands to the range
|
|
$(-\pi,+\pi)$, as in:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/sin_clamp.pdf}{\includegraphics{diagrams/sin_clamp.pdf}}
|
|
\end{figure}
|
|
|
|
``Remapping'' floating point values from $(-\infty,+\infty)$ to $(-\pi,+\pi)$ is not
|
|
obvious. I was not previously aware of this transformation:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/sin_frac.pdf}{\includegraphics{diagrams/sin_frac.pdf}}
|
|
\end{figure}
|
|
|
|
Or, expressed as R500 vertex shader assembly:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/sin_operand_remap.vs.asm}{\includegraphics{verbatim/output/sin_operand_remap.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
\subsubsection{Translation of the GLSL vertex shader to R500 vertex shader assembly}
|
|
|
|
Having verified that the GLSL version works as expected in OpenGL, and knowing
|
|
how to use the R500 vertex shader sin/cos operations, then I translated the GLSL
|
|
to R500 vertex shader assembly, as:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate.vs.asm}{\includegraphics{verbatim/output/cube_rotate.vs.asm.pdf}}
|
|
\caption*{\texttt{cube\_rotate.vs.asm}}
|
|
\end{figure}
|
|
|
|
\subsubsection{Vertex shader assembler/code generator debugging}
|
|
|
|
However, when I first executed the vertex shader cube rotation demo, I found
|
|
it did not work as expected:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/texture_cube_clear_zwrite_vertex_shader_incorrect.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader.c}\\(incorrect vertex shader assembler output)}
|
|
\end{figure}
|
|
|
|
After hours of debugging, I eventually found the issue was in this instruction:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_3_temp.vs.asm}{\includegraphics{verbatim/output/cube_rotate_3_temp.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} briefly mentions this on pages 98 and 99:
|
|
|
|
\begin{quote}
|
|
The PVS\_DST\_MACRO\_INST bit was meant to be used for MACROS such as a
|
|
vector-matrix multiply, but currently is only set for the following cases:
|
|
|
|
A VE\_MULTIPLY\_ADD or VE\_MULTIPLYX2\_ADD instruction with all 3 source
|
|
operands using unique PVS\_REG\_TEMPORARY vector addresses. Since R300 only has
|
|
two read ports on the temporary memory, this special case of these instructions
|
|
is broken up (by the HW) into 2 operations.
|
|
\end{quote}
|
|
|
|
I read this paragraph much earlier, but I didn't fully understand it until
|
|
now. Indeed, this multiply-add has three unique \texttt{temp} addresses, and
|
|
must be encoded as a ``macro'' instruction.
|
|
|
|
I fixed this in my vertex shader assembler by
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/91f83bdaa8/regs/assembler/vs/validator.py}{counting the number of unique temp addresses}
|
|
referenced by each instruction, promoting \texttt{VE\_MULTIPLY\_ADD} to
|
|
\texttt{PVS\_MACRO\_OP\_2CLK\_MADD} if more than two unique \texttt{temp}
|
|
addresses are referenced.
|
|
|
|
With this change, reassembling the same vertex shader source code now produces a
|
|
correct vertex shader cube rotation:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/texture_cube_clear_zwrite_vertex_shader.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader.c}\\(correct vertex shader assembler output)}
|
|
\end{figure}
|
|
|
|
\subsection{Comparison with Mesa's R500 vertex shader compiler}
|
|
|
|
My ``cube rotation'' vertex shader,
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/50244c7c95/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm}
|
|
is 15 instructions.
|
|
|
|
Mesa's R500 vertex shader compiler generated a
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/50244c7c95/shader_examples/mesa/texture_cube_depth_vertex_shader.vs.txt}{27-instruction vertex shader}
|
|
from \href{https://r500.idk.st/verbatim/cube_rotate.vs.glsl}{semantically equivalent GLSL code}. Disassembly:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/mesa_cube_rotate.vs.asm}{\includegraphics{verbatim/output/mesa_cube_rotate.vs.asm.pdf}}
|
|
\caption*{\texttt{mesa\_cube\_rotate.vs.asm}}
|
|
\end{figure}
|
|
|
|
I was not particularly trying to write concise code, but I find this difference
|
|
in instruction count to be surprising. In general it seems Mesa's R500 vertex
|
|
shader compiler failed to vectorize several operations, and does significantly
|
|
more scalar multiplies and scalar multiply-adds than my implementation.
|
|
|
|
Ignoring algorithmic improvements (such as lifting the sin/cos calculation to
|
|
x86 code and instead sending a 4x4 matrix to the vertex shader), there is still
|
|
more opportunity for optimization beyond my 15-instruction implementation.
|
|
|
|
Particularly, the vertex shader unit has a ``dual math'' instruction mode, where
|
|
``vector engine'' (VE\_) and ``math engine'' (ME\_) operations can be executed
|
|
simultaneously in the same instruction. \texttt{cube\_rotate.vs.asm} would
|
|
indeed benefit from such an optimization--most of the \texttt{ME\_SIN} and
|
|
\texttt{ME\_COS} instructions could be interleaved with the \texttt{VE\_MUL} and
|
|
\texttt{VE\_MAD} operations that follow (at significant expense to
|
|
human-readability).
|
|
|
|
I am curious to see more examples of the difference between Mesa's R500 vertex
|
|
shader compiler output and my own vertex shader assembly.
|
|
|
|
\subsection{Fragment shader instruction expressiveness}
|
|
|
|
Compared to the R500 vertex shader instructions, the R500 fragment shader
|
|
instructions are significantly more featureful. This makes inventing a syntax
|
|
that can fully express the range of operations that a R500 fragment shader
|
|
instruction can do more complex.
|
|
|
|
A significant difference is where R500 vertex shaders have a single tier of
|
|
operand argument decoding, as in:
|
|
|
|
\begin{figure}
|
|
\includegraphics{diagrams/vertex_inputs.svg}
|
|
\caption*{R500 vertex shader instruction operand inputs (simplified)}
|
|
\end{figure}
|
|
|
|
While R500 fragment shaders have multiple tiers of operand argument decoding, as
|
|
in:
|
|
|
|
\begin{figure}
|
|
\includegraphics{diagrams/fragment_inputs.svg}
|
|
\caption*{R500 fragment shader instruction operand inputs (simplified)}
|
|
\end{figure}
|
|
|
|
I've written several \href{https://github.com/buhman/scu-dsp-asm}{nice assemblers}
|
|
for other architectures in the past, but I've never seen any instruction set
|
|
as expressive as R500 fragment shaders.
|
|
|
|
I attempted to directly represent this ``multiple tiers of operand argument
|
|
decoding'' in my fragment shader ALU instructions syntax.
|
|
|
|
These instructions are also vector instructions: a total of 24 floating point
|
|
input operands and 8 floating results could be evaluated per instruction.
|
|
|
|
With this abundance of expressiveness and a relatively high skill ceiling, I'm
|
|
amazed R500 fragment shader assembly isn't more popular in programming
|
|
competitions, general everyday conversation, etc...
|
|
|
|
\subsection{Fragment shader assembler bugs}
|
|
|
|
There were two ``I spent a lot of time debugging this'' issues I encountered
|
|
with my fragment shader assembler.
|
|
|
|
The first was in this code I wrote to draw a fragment shaded circle, as in:
|
|
|
|
\begin{figure}
|
|
\href{images/shadertoy_circle.png}{\includegraphics{images/shadertoy_circle.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{shadertoy\_circle.fs.asm}}
|
|
\end{figure}
|
|
|
|
However, in an earlier version of my fragment shader assembler, I produced this
|
|
image instead:
|
|
|
|
\begin{figure}
|
|
\href{images/shadertoy_circle_incorrect.png}{\includegraphics{images/shadertoy_circle_incorrect.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{shadertoy\_circle.fs.asm}\\(incorrect assembler output)}
|
|
\end{figure}
|
|
|
|
In this handwritten fragment shader code:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/shadertoy_circle.fs.asm}{\includegraphics{verbatim/output/shadertoy_circle.fs.asm.pdf}}
|
|
\caption*{\texttt{shadertoy\_circle.fs.asm}}
|
|
\end{figure}
|
|
|
|
\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} says briefly on page 241:
|
|
|
|
\begin{quote}
|
|
Specifies whether to insert a NOP instruction after this. This would get
|
|
specified in order to meet dependency requirements for the pre-subtract inputs,
|
|
and dependency requirements for src0 of an MDH/MDV instruction.
|
|
\end{quote}
|
|
|
|
The issue is the pre-subtract input for the \texttt{MAD |srcp.a| src0.1 -src2.a}
|
|
instruction depends on the write to \texttt{temp[0].a} from the immediately
|
|
preceding \texttt{RCP src0.a} instruction--a pipeline hazard.
|
|
|
|
To fix this, I added support for
|
|
\href{https://git.idk.st/bilbo/r500/commit/fe0684ca5e58ed3be026410812c042e883bdce71}{generating the \texttt{NOP} bit}
|
|
in my fragment shader assembler.
|
|
|
|
\subsection{More fragment shader assembler bugs}
|
|
|
|
While trying to produce this image:
|
|
|
|
\begin{figure}
|
|
\href{images/shadertoy_palette.png}{\includegraphics{images/shadertoy_palette.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{shadertoy\_palette.fs.asm}}
|
|
\end{figure}
|
|
|
|
My fragment shader code instead produced this image:
|
|
|
|
\begin{figure}
|
|
\href{images/shadertoy_palette_incorrect.png}{\includegraphics{images/shadertoy_palette_incorrect.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{shadertoy\_palette.fs.asm}\\(incorrect assembler output)}
|
|
\end{figure}
|
|
|
|
The issue was simply that in the chaos of all of the other features I was
|
|
implementing for my fragment shader assembler, I
|
|
\href{https://git.idk.st/bilbo/r500/commit/f6a0fc4fab5dee3085dcf4b9a984244bba05d5ca}{forgot to emit the \texttt{ADDRD} bits}.
|
|
|
|
This meant that while fragment shader code that exclusively uses zero-address
|
|
destinations, such as \texttt{shadertoy\_circle.fs.asm}, appeared to work
|
|
completely correctly, I encountered this bug as soon as I started using non-zero
|
|
addresses such as \texttt{temp[1]} in my fragment shader code.
|
|
|
|
\subsection{Comparison to Direct3D ``asm''}
|
|
|
|
Prior to Direct3D 10, Microsoft previously defined a specification for both
|
|
\href{https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx9-graphics-reference-asm-vs-3-0}{vertex shader assembly} and
|
|
\href{https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx9-graphics-reference-asm-ps-3-0}{fragment shader assembly}.
|
|
|
|
The Direct3D ``asm'' name is slightly deceptive, however, as the
|
|
\texttt{vs\_3\_0} and \texttt{ps\_3\_0} instruction syntax does not map 1-to-1
|
|
with any hardware that exists.
|
|
|
|
It would perhaps be more accurate to think of Direct3D's ``asm''
|
|
language and compiler as more analogous to a
|
|
\href{https://en.wikipedia.org/wiki/BASIC}{shader BASIC} than as a true assembly
|
|
language on the same level as ``6502 assembly'', ``Z80 assembly'' and similar.
|
|
|
|
In contrast, my R500 assembly syntaxes are deliberately/explicitly mapped 1-to-1
|
|
with R500 instructions.
|
|
|
|
\subsection{Fragment shader animated demo}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/shadertoy_palette.png}
|
|
\caption*{R500 DVI capture, \texttt{shadertoy\_palette.fs.asm}}
|
|
\end{figure}
|
|
|
|
The R500 fragment shader code that I handwrote for this is:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/shadertoy_palette.fs.asm}{\includegraphics{verbatim/output/shadertoy_palette.fs.asm.pdf}}
|
|
\caption*{\texttt{shadertoy\_palette.fs.asm}}
|
|
\end{figure}
|
|
|
|
The \texttt{float} constants are interesting--they are decoded almost
|
|
identically to the
|
|
\href{https://en.wikipedia.org/wiki/Minifloat#8-bit_(1.4.3)}{8-bit (1.4.3) (bias 7) format shown on Wikipedia},
|
|
except:
|
|
\begin{itemize}
|
|
\item There is no sign bit (the value is always positive--positive values
|
|
can be swizzled to produce negative operands)
|
|
\item There is no ``zero'' value (zero can also be instead obtained via
|
|
swizzles); the ``all zeros'' bit pattern instead has a value of
|
|
\texttt{0.0009765625}.
|
|
\item There are no infinite or not-a-number values: a ``15'' exponent is treated
|
|
as 15.
|
|
\end{itemize}
|
|
|
|
The exponent/mantissa table that shows example 7-bit float values on page 106 of
|
|
\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf} is incorrect.
|
|
|
|
\section{Progress: 26 Oct 2025}
|
|
|
|
From 21 Oct 2025 to 26 Oct 2025, I achieved the following (roughly in chronological order):
|
|
|
|
\begin{itemize}
|
|
\item I \href{https://git.idk.st/bilbo/r500/commit/8594bc4a38f6fcab2ac6e437b46bcf1e0e6d32dd}{rewrote} most of the vertex shader assembler parser/validator, and implemented support for \href{https://git.idk.st/bilbo/r500/commit/f3f1969f4a9b336536f5fb23d246f7103c41e20d}{assembling/disassembling ``dual math'' operations}
|
|
\item I implemented support for \href{https://git.idk.st/bilbo/r500/commit/96d7286e7cd3270b9dca0924d3a046d585d6dc9d}{assembling} and \href{https://git.idk.st/bilbo/r500/commit/27227426eaac265bc3126edd7d017c791640e789}{disassembling} TEX fragment shader instructions
|
|
\item I presented this project (including live demos on real hardware) at
|
|
a \href{https://itch.io/jam/spoopy-jam-7-heckraiser}{local in-person game jam event}
|
|
\end{itemize}
|
|
|
|
\subsection{Vertex shader optimization part 1: ``MOV'' elimination}
|
|
|
|
After talking about it in-person, I decided to try to golf my original
|
|
15-instruction
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm} vertex shader.
|
|
|
|
The first opportunity for optimization is in the first two instructions of:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_const_move.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
The \texttt{VE\_ADD} (being used here as a ``MOV'' instruction) is needed
|
|
because there is only a single 128-bit read port into \texttt{const} memory, so
|
|
a multiply-add like this is illegal:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_const_move_illegal.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_illegal.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
I observed that because I never need to reference the last two constants in the
|
|
same instruction that references the first two constants, if I rearrange the
|
|
ordering of the constants to:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_const_move_rearrange.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_rearrange.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
I can then rewrite the multiply-add instructions as:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_const_move_rearrange_mad.vs.asm}{\includegraphics{verbatim/output/cube_rotate_const_move_rearrange_mad.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
\subsection{Vertex shader optimization part 2: ``dual math'' instructions}
|
|
|
|
I spent an entire day rewriting large portions of the vertex shader assembler to
|
|
add support for ``dual math'' instructions.
|
|
|
|
The original
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate.vs.asm}{cube\_rotate.vs.asm}
|
|
contains this sequence of \texttt{ME_SIN}/\texttt{ME\_COS} instructions:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_sin_cos.vs.asm}{\includegraphics{verbatim/output/cube_rotate_sin_cos.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
The \texttt{temp[3].x} and \texttt{temp[3].y} results are needed immediately,
|
|
but \texttt{temp[3].z} and \texttt{temp[3].w} are not needed until after the
|
|
first pair of \texttt{VE\_MUL}/\texttt{VE\_MAD} operations.
|
|
|
|
The dual math instruction mode replaces the 3rd \texttt{VE_} instruction operand
|
|
with any \texttt{ME\_} operation, so it is only usable with 2-operand
|
|
\texttt{VE\_} instructions like \texttt{VE\_MUL}.
|
|
|
|
The dual math encoding also has several restrictions (it only has \nicefrac{1}{4}th the
|
|
control word bits compared to a normal \texttt{ME\_} instruction). A notable
|
|
restriction is that it must write to \texttt{alt\_temp}.
|
|
|
|
Unlike the fancy things that can be done with fragment shader
|
|
operands/sources/swizzles, a single vertex shader operand can also only read
|
|
from a single 128-bit register, so this means to be able to continue to access
|
|
\texttt{temp[3].zw} as a vector, both \texttt{z} and \texttt{w} must now be
|
|
stored in \texttt{alt\_temp}, even if only one of them was written by a ``dual
|
|
math'' instruction.
|
|
|
|
The change (and my newly-implemented dual math syntax) is:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_dual_math.vs.asm}{\includegraphics{verbatim/output/cube_rotate_dual_math.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
Where the dual math instruction:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/cube_rotate_dual_math_single_instruction.vs.asm}{\includegraphics{verbatim/output/cube_rotate_dual_math_single_instruction.vs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
Is encoded by the assembler as single instruction and is executed by the vertex
|
|
shader unit in a single clock cycle.
|
|
|
|
The final
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/c8ae311e60/drm/cube_rotate_optimize.vs.asm}{cube\_rotate\_optimize.vs.asm}
|
|
was reduced from 15 instructions to 13 instructions (compared
|
|
to Mesa's R500 vertex shader compiler's 27 instructions).
|
|
|
|
\section{Progress: 29 Oct 2025}
|
|
|
|
From 27 Oct 2025 to 29 Oct 2025, I achieved the following (roughly in chronological order):
|
|
|
|
\begin{itemize}
|
|
\item I implemented support for \href{https://git.idk.st/bilbo/r500/commit/9aecbbfc6f297ea71c72f4c4fba1b8107be95ca1}{``multiple render targets''} in the fragment shader assembler
|
|
\item I wrote a \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/texture_blur_horizontal.fs.asm}{gaussian blur fragment shader}
|
|
\item I made a demo that draws \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L963}{multiple 3D ``objects''} where each object's UV coordinates sample a \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L1029-L1069}{different} \href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/drm/pumpkin_man.c#L314}{texture}
|
|
\item I did several experiments related to R500's Z-buffer implementation
|
|
\end{itemize}
|
|
|
|
\subsection{Z-buffer experiments}
|
|
\label{sec:z-buffer-experiments}
|
|
Though I produced a ``properly'' Z-buffered 3D cube demo previously, I felt I
|
|
did not fully understand the relationship between Z coordinates, W coordinates,
|
|
viewport transformations, and the actual values that are written the the
|
|
Z-buffer. At some point, I'd like to write fragment shaders that sample the
|
|
Z-buffer, so I feel I need to understand this more rigorously.
|
|
|
|
For comparison, Sega Dreamcast stores 32-bit floating-point values in the
|
|
``depth accumulation buffer''. This effectively means that any Z coordinates can
|
|
be stored in the depth accumulation buffer without scaling or range
|
|
remapping. I've made several
|
|
\href{https://az1.idk.st/public/20kdm2-demo.mp4}{moderately fancy} Dreamcast
|
|
demos in that happily store arbitrary ``view space'' Z values in the depth
|
|
accumulation buffer without any visible depth aliasing/artifacts.
|
|
|
|
In contrast, the Radeon R500 does not have a 32-bit floating point Z-buffer
|
|
format. Instead, R500 supports (\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}, page 283,
|
|
\texttt{ZB\_FORMAT}):
|
|
|
|
\begin{itemize}
|
|
\item 16-bit integer Z
|
|
\item 16-bit floating point
|
|
\item 24-bit integer Z with 8-bit stencil
|
|
\end{itemize}
|
|
|
|
The third option, with the most bits, clearly ought to give the most
|
|
precision--with the caveat that the Z values that are written to the Z-buffer
|
|
should be scaled to be uniformly distributed across the range of 24-bit integers.
|
|
|
|
I performed several tests with variations of
|
|
\href{https://git.idk.st/bilbo/r500/src/branch/main/drm/zbuffer_test.c}{zbuffer\_test.c}. The
|
|
general strategy was:
|
|
|
|
\begin{itemize}
|
|
\item Define some contrived/illustrative 3D scene
|
|
\item Manipulate the scale/range of Z and W values
|
|
\item Observe the state of the Z-buffer after rendering
|
|
\end{itemize}
|
|
|
|
The first scene I chose was of a tilted plane that is non-coplanar with the view
|
|
space XY plane, as in:
|
|
|
|
\begin{figure}
|
|
\href{images/plane_scene.png}{\includegraphics{images/plane_scene.png}}
|
|
\caption*{Blender screenshot, ``plane scene''}
|
|
\end{figure}
|
|
|
|
Where the grey plane is the object that is to be rendered, the yellow lines
|
|
represent a ``camera'' from which the plane is to be viewed, and the blue line
|
|
represents the view/clip-space Z axis.
|
|
|
|
To view the content of the Z buffer, I wrote a
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/18b7a593bd/tools/zbuf_decode.py}{simple script}
|
|
to convert the 24-bit integer Z-buffer to 16-bit
|
|
\href{https://en.wikipedia.org/wiki/Netpbm}{PGM},
|
|
so that it can be easily viewed in an image editor. This tool also shows the
|
|
minimum and maximum values found in the Z-buffer, intended to help verify that
|
|
the entire numeric range of the Z-buffer is being used.
|
|
|
|
While I expected to see the (orthographic, directly facing the camera) plane
|
|
drawn on the Z-buffer as a smooth gradient such as:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_gradient.png}{\includegraphics{images/z_buffer_gradient.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_gradient.png}}
|
|
\end{figure}
|
|
|
|
Several of my tests displayed numeric aliasing, overflows, underflows, etc..:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_overflow.png}{\includegraphics{images/z_buffer_overflow.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_overflow.png}}
|
|
\end{figure}
|
|
|
|
Of particular interest to me was to verify the behavior of the
|
|
\texttt{DX\_CLIP\_SPACE\_DEF} bit
|
|
(\href{doc/R5xx_Acceleration_v1.5.pdf}{R5xx\_Acceleration\_v1.5.pdf}, page
|
|
255--this is also the only place in the entire manual where ``non-user'' clip
|
|
planes are even defined), and to understand the order of pipeline operations.
|
|
|
|
I played with moving the plane around, to observe clipping behavior (here the
|
|
lower half of the scene was clipped due to intersecting the Z=+1.0 clip plane):
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_clipped.png}{\includegraphics{images/z_buffer_clipped.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_clipped.png}\\
|
|
(also simultaneously showing overflow/underflow artifacts)}
|
|
\end{figure}
|
|
|
|
Thinking at this point that I nearly understood most of the pieces, I then
|
|
re-enabled XY perspective division:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_perspective.png}{\includegraphics{images/z_buffer_perspective.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_perspective.png}}
|
|
\end{figure}
|
|
|
|
The above image was not quite what I wanted: I noticed the range of the Z buffer
|
|
values were roughly between \texttt{0} and \texttt{8388607}, but what I really
|
|
wanted was \texttt{0} to \texttt{16777215}. Adjusting scale again produced this
|
|
Z-buffer:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_perspective_scale.png}{\includegraphics{images/z_buffer_perspective_scale.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_perspective\_scale.png}}
|
|
\end{figure}
|
|
|
|
Up to this point, I was using \texttt{ZFUNC=GREATER} with a Z-buffer cleared
|
|
with an initial depth of zero, where all Z values are negative numbers.
|
|
|
|
I decided it might be more intuitive to use a Z-buffer that is cleared with an
|
|
initial depth of one, using \texttt{ZFUNC=LESS} instead where all Z values are
|
|
positive numbers.
|
|
|
|
With these adjustments, I captured a Z-buffer from the earlier cube demo:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_cube.png}{\includegraphics{images/z_buffer_cube.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube.png}}
|
|
\end{figure}
|
|
|
|
This was still not quite ``correct'', because the minimum depth of the cube is
|
|
being drawn as \textasciitilde{}\texttt{2763306} (\textasciitilde{}0.16), but I expected
|
|
something closer to zero.
|
|
|
|
Adjusting my range/scale arithmetic again produced this image:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_cube_range.png}{\includegraphics{images/z_buffer_cube_range.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube\_range.png}}
|
|
\end{figure}
|
|
|
|
The minimum Z value now appears to be closer to zero, but the ``back'' faces of
|
|
the cube (and maximum Z values) are not visible. Without changing any
|
|
scale/range constants, inverting \texttt{ZFUNC} and using a zero-initialized
|
|
Z-buffer produced this image of the back faces of the cube:
|
|
|
|
\begin{figure}
|
|
\href{images/z_buffer_cube_range_back.png}{\includegraphics{images/z_buffer_cube_range_back.png}}
|
|
\caption*{R500 framebuffer capture, \texttt{z\_buffer\_cube\_range\_back.png}}
|
|
\end{figure}
|
|
|
|
Indeed, the maximum Z value is close to \textasciitilde{}\texttt{16777215}
|
|
(\textasciitilde{}1.0), as intended. I feel at this point I have a better intuition
|
|
for using integer Z-buffers. The pipeline (and relevant registers) appears to be
|
|
something like this:
|
|
|
|
\begin{figure}
|
|
\includegraphics{diagrams/z_operations.svg}
|
|
\caption*{R500 Z transform pipeline (simplified)}
|
|
\end{figure}
|
|
|
|
Prior to these experiments, I was not aware \texttt{SU\_DEPTH\_SCALE} is the
|
|
thing directly responsible for scaling floating point Z values to the integer Z
|
|
values stored in the depth buffer.
|
|
|
|
In general, the hardware perspective divide, viewport transform, clipping, and
|
|
setup units are absolutely fascinating.
|
|
|
|
\subsection{3D perspective}
|
|
|
|
Despite making many 3D demos in the past, I feel that every time I want to
|
|
``draw something 3D'' on a new platform, I need to re-relearn 3D/perspective
|
|
transformations, (perhaps because I never truly \textit{learned} anything).
|
|
|
|
In many OpenGL articles/tutorials/books the
|
|
\href{https://learnopengl.com/Getting-started/Coordinate-Systems}{standard}
|
|
\href{https://ogldev.org/www/tutorial12/tutorial12.html}{formula} for
|
|
\href{https://songho.ca/opengl/gl_projectionmatrix.html}{explaining}
|
|
\href{https://www.scratchapixel.com/lessons/3d-basic-rendering/perspective-and-orthographic-projection-matrix/opengl-perspective-projection-matrix.html}{perspective}
|
|
\href{https://learnwebgl.brown37.net/08_projections/projections_perspective.html}{projection}
|
|
appears to be:
|
|
|
|
\begin{itemize}
|
|
\item Begin with an overly-academic explanation of perspective in terms of camera optics and trigonometry
|
|
\item Do not implement or demonstrate the any of the systems or mathematics
|
|
described in the preceding pages of explanations; intead abruptly hide all
|
|
magic behind \texttt{glm::perspective}
|
|
\item Refuse to explain or clarify further
|
|
\item Continue for the next 30 chapters/articles without ever revisiting focal
|
|
length, view frustums, depth of field, etc.. again
|
|
\end{itemize}
|
|
|
|
It is sufficient to instead rationalize/implement ``perspective'' as:
|
|
|
|
\begin{quote}
|
|
Perspective is the division of X and Y coordinates by Z, where the coordinate
|
|
$(0, 0, 0)$ is the view origin (and the center of the screen/projection).
|
|
\end{quote}
|
|
|
|
Defining perspective this way also works for OpenGL, with some slight
|
|
adjustment, notably to deal with OpenGL's
|
|
\href{https://registry.khronos.org/OpenGL/specs/gl/glspec20.pdf}{definition} of
|
|
``normalized device coordinates''.
|
|
|
|
I note that (unlike Dreamcast) one can't actually divide by Z on R500 (nor
|
|
OpenGL), both because the VTE doesn't support this, and because the texture
|
|
unit doesn't support this. Of course, I tried it anyway:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/cube_warped_textures.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_warping.c} \\
|
|
(unrelated to this demo, R500 also interestingly has a dedicated ``disable perspective-correct texture mapping'' bit)}
|
|
\end{figure}
|
|
|
|
Instead, in both cases, the R500 uses the W coordinate for division. This turns
|
|
out to be very convenient, because it means that that the ``field of
|
|
view''/perspective scale (W) and the Z-buffer/depth test scale (Z) can be
|
|
adjusted independently.
|
|
|
|
\subsection{3D clipping}
|
|
|
|
Here are several examples of improperly scaled Z values, which are being clipped
|
|
by the setup unit:
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/cube_clipped_far.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
|
|
(``far'' clip plane intersection)}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/cube_clipped_near.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
|
|
(``near'' clip plane intersection)}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/cube_clipped_near_opengl.png}
|
|
\caption*{R500 DVI capture, \texttt{texture\_cube\_clear\_zwrite\_vertex\_shader\_optimize\_zscale.c} \\
|
|
(I am curious to learn under what circumstances the OpenGL designers thought\\ $-w_{c} < z_{c} < w_{c}$ was a good idea)}
|
|
\end{figure}
|
|
|
|
\section{Progress: 31 Oct 2025}
|
|
|
|
From 30 Oct 2025 to 31 Oct 2025, I achieved the following (non-chronological):
|
|
|
|
\begin{itemize}
|
|
\item I implemented a \href{https://git.idk.st/bilbo/r500/src/branch/main/drm/matrix_cubesphere_specular.fs.asm}{diffuse/specular lighting fragment shader} in R500 fragment shader assembly
|
|
\item I made vertex shaders that represent coordinate space transformations
|
|
using matrix multiplications rather than ad-hoc arithmetic
|
|
\item While writing demos that pass multiple (interpolated) vectors from the
|
|
vertex shader to the fragment shader, I learned more about \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L444-L512}{``rasterizer instructions''}
|
|
\item I made a demo that uses more than one texture for the entire scene
|
|
(by \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/pumpkin_man.c#L272-L317}{reconfiguring
|
|
the texture unit for each ``object''})
|
|
\end{itemize}
|
|
|
|
\subsection{Lighting demo}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/suzanne.png}
|
|
\caption*{R500 DVI capture, \texttt{matrix\_cubesphere\_specular\_suzanne.cpp} \\
|
|
(subdivided Suzanne mesh, 15,744 triangles)}
|
|
\end{figure}
|
|
|
|
Despite being a ``simple'' lighting demo, a surprising number of things need to
|
|
happen simultaneously before it becomes possible.
|
|
|
|
Where vertex shaders from previous demos were passed at most a single scalar
|
|
variable for animation/timing, the vertex shader in this demo uses
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L301-L326}{10 vectors} as
|
|
input:
|
|
|
|
\begin{itemize}
|
|
\item 4 vectors for a ``local space to clip space'' transformation matrix
|
|
\item 4 vectors for a ``local space to world space'' transformation matrix (used for lighting)
|
|
\item 1 vector for a ``light position'' (in world space coordinates, used for lighting)
|
|
\item 1 vector for a ``view origin'' (in world space coordinates, used for lighting)
|
|
\end{itemize}
|
|
|
|
Additionally, where previous demos passed at most a single vector from the
|
|
vertex shader to the fragment shader (vertex color or texture coordinates), this
|
|
demo passes
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular_suzanne.cpp#L444-L512}{5 vectors}
|
|
from the vertex shader to the fragment shader, all of which are used
|
|
by the lighting calculation:
|
|
|
|
\begin{itemize}
|
|
\item world space position
|
|
\item world space normal
|
|
\item world space light position
|
|
\item world space view origin
|
|
\item uv space texture coordinates
|
|
\end{itemize}
|
|
|
|
\subsection{Learn algebra by writing fragment shader assembly}
|
|
|
|
Prior to today, I did not know about this transformation/equivalence:
|
|
|
|
\begin{gather*}
|
|
x^{n} \iff 2^{\left( n\cdot\frac{\log(x)}{\log(2)} \right)}
|
|
\end{gather*}
|
|
|
|
While the R500 fragment shader alpha unit does not have a \texttt{POW} operation,
|
|
it does have \href{https://git.idk.st/bilbo/r500/src/commit/f43ac599f9/drm/matrix_cubesphere_specular.fs.asm#L93-L99}{\texttt{EX2} and \texttt{LN2}}
|
|
operations.
|
|
|
|
For example, one could implement $a^{32}$ in R500 fragment shader assembly as:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/pow_fragment_shader.fs.asm}{\includegraphics{verbatim/output/pow_fragment_shader.fs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
This ``arbitrary exponents with arbitrary bases'' pattern is used in the
|
|
lighting demo fragment shader as part of the ``specular intensity'' calculation.
|
|
|
|
This fragment shader unit feature is very cool, because a software
|
|
implementation of a generalized floating-point \texttt{pow} function is
|
|
extremely
|
|
\href{https://git.musl-libc.org/cgit/musl/tree/src/math/powf.c?id=cb5c057c87240a9534f8e0d9b7ff2560082f6218}{computationally expensive}
|
|
otherwise.
|
|
|
|
\section{Progress: 11 Nov 2025}
|
|
|
|
From 1 Nov 2025 to 11 Nov 2025, I achieved the following:
|
|
|
|
\begin{itemize}
|
|
\item I briefly experimented with \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/matrix_cubesphere_cubemap.cpp#L1081-L1088}{cubemap} \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/matrix_cubesphere_cubemap.cpp#L503}{textures}
|
|
\item I experimented with point primitives and texture coordinate ``stuffing''
|
|
\item I made a demo that generates and uses \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/matrix_cubesphere_tiled.cpp}{macrotiled/microtiled textures}
|
|
\item I created a particle system demo where the \href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_physics.fs.asm}{particle simulation is computed in a fragment shader}
|
|
\item I implemented \href{https://git.idk.st/bilbo/r500/commit/9e281cba583ec4a06e02470310c31cdad6962f64}{support for \texttt{\#include} directives} in my vertex and fragment shader assemblers
|
|
\item I used the new \texttt{\#include} feature to more concisely express an \href{https://git.idk.st/bilbo/r500/commit/fdff78f1ad/main/drm/shadertoy_palette_fractal.fs.asm#L26-L29}{unrolled loop}
|
|
\end{itemize}
|
|
|
|
\subsection{Rewriting GLSL ``shadertoy'' shaders as R500 assembly}
|
|
|
|
I felt \href{https://www.shadertoy.com/view/mtyGWy}{``Shader Art Coding Introduction''}
|
|
would be fun to reimplement as R500 fragment shader assembly: it produces
|
|
an interesting visual effect, despite not being particularly complicated.
|
|
|
|
In general when writing assembly programs, I use a few techniques to improve my
|
|
productivity and accuracy, prior to writing any actual assembly:
|
|
|
|
\subsubsection{Expand/rewrite the GLSL code}
|
|
|
|
In particular, my goal in this step is to make each line of GLSL roughly equal
|
|
to one fragment shader instruction. I started by rewriting all of the function
|
|
calls the \href{https://www.shadertoy.com/view/mtyGWy}{original} code made as:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_functions.fs.glsl}{\includegraphics{verbatim/output/palette_fractal_functions.fs.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
Defining replacements for GLSL's built-in functions is not a good practice for
|
|
writing GLSL code in general. However, the goal in this specific situation is to
|
|
give myself line-by-line hints on the R500 fragment shader assembly that I'll
|
|
eventually need to write.
|
|
|
|
I then rewrote the \texttt{main} function in a similar ``one line of GLSL per
|
|
R500 instruction'' style as:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_main.fs.glsl}{\includegraphics{verbatim/output/palette_fractal_main.fs.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
I also decided the multiplication by 0.125, where it normally would have
|
|
required a separate multiply-add instruction, was a perfect excuse to
|
|
\href{https://git.idk.st/bilbo/r500/commit/90b486e744c14bb23283218108799186162afaad}{implement assembler support}
|
|
for the ``OMOD'' R500 fragment shader feature/syntax that I previously
|
|
\href{https://git.idk.st/bilbo/r500/commit/8e6e6e9750a33759b51ed73d3e238ebe77ee3f61}{implemented in my fragment shader disassembler}.
|
|
|
|
\subsubsection{Assign all temporary variables to registers}
|
|
|
|
Still prior to writing any fragment shader assembly, I then decided where I
|
|
would store each GLSL variable in fragment shader temporary/constant memory:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_memory.fs.asm}{\includegraphics{verbatim/output/palette_fractal_memory.fs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
I intentionally stored scalar values in the alpha component of each
|
|
vector. Given my current fragment shader assembler syntax, this allows for
|
|
slightly more improved human-readability. For example, doing a scalar
|
|
multiply-add with the alpha unit looks like this:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_alpha_mad.fs.asm}{\includegraphics{verbatim/output/palette_fractal_alpha_mad.fs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
If the \texttt{l} variable were instead stored in the green component, the code
|
|
would be slightly uglier, as in:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_rgb_mad.fs.asm}{\includegraphics{verbatim/output/palette_fractal_rgb_mad.fs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
\subsubsection{Translate the GLSL to R500 fragment shader assembly, line-by-line}
|
|
|
|
Because the GLSL code was transformed to very closely match the fragment shader
|
|
assembly, this also makes it easy to test the fragment shader output when only a
|
|
fraction of the complete program is translated (e.g: by commenting out chunks of
|
|
the GLSL code to match the current state of the in-progress fragment shader
|
|
assembly translation).
|
|
|
|
The visual appearance of a half-translated varient of this fragment shader is
|
|
not intuitive, so this technique greatly improves debuggability. I made at least
|
|
at least two mistakes while translating that were not difficult to debug at a
|
|
per-instruction level by comparing the equivalent GLSL code's visuals.
|
|
|
|
\subsubsection{Translate a fixed-length GLSL loop}
|
|
|
|
Though the R500 does support it, my fragment shader assembler does not currently
|
|
implement support for loops (or any other type of flow control).
|
|
|
|
R500 fragment shader flow control is also relatively expensive compared to
|
|
``loop unrolling'', particularly in this case where the loop body is only 32
|
|
instructions, and there are only 4 total iterations of the loop body.
|
|
|
|
For this reason, I decided I wanted a concise and generalized way to ``repeat''
|
|
chunks of source code in my fragment shader assembler, without actually
|
|
duplicating the text.
|
|
|
|
To do this, I
|
|
\href{https://git.idk.st/bilbo/r500/commit/9e281cba583ec4a06e02470310c31cdad6962f64}{implemented}
|
|
an ``\texttt{\#include}'' feature in my fragment shader assembler. This is
|
|
conceptually similar to how \texttt{\#include} works in the C programming
|
|
language, though my implementation simply feeds tokens from the included file
|
|
directly from the (nested) lexer to the parser, rather than the much more
|
|
complex procedure used by the C preprocessor.
|
|
|
|
With this new feature, the translation of the GLSL loop is very simple:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/palette_fractal_loop.fs.asm}{\includegraphics{verbatim/output/palette_fractal_loop.fs.asm.pdf}}
|
|
\end{figure}
|
|
|
|
The full implementation is committed as
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/shadertoy_palette_fractal.fs.asm}{shadertoy\_palette\_fractal.fs.asm} and
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/drm/shadertoy_palette_fractal_loop_inner.fs.asm}{shadertoy\_palette\_fractal\_loop\_inner.fs.asm}.
|
|
|
|
\subsubsection{Demo videos}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/shadertoy_fractal.png}
|
|
\caption*{R500 DVI capture, \texttt{shadertoy\_palette\_fractal.fs.asm}\\(variant)}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/shadertoy_fractal2.png}
|
|
\caption*{R500 DVI capture, \texttt{shadertoy\_palette\_fractal.fs.asm}}
|
|
\end{figure}
|
|
|
|
\subsection{Fragment shader particle simulation}
|
|
|
|
\subsubsection{Using fragment shaders to render non-pixel data}
|
|
|
|
ATI documentation \href{doc/R2VB_programming.pdf}{mentioned} the existence of a
|
|
``Render to Vertex Buffer'' feature.
|
|
|
|
The general idea/revelation is:
|
|
|
|
\begin{itemize}
|
|
\item fragment shader output does not need to be ``pixel data''--it can
|
|
arbitrarily be assigned any desired meaning
|
|
\item by alternating between a pair of buffers, fragment shader output can be
|
|
used as the input for the next invocation of the same fragment shader
|
|
\end{itemize}
|
|
|
|
The state manipulated by the pixel shader is double-buffered, where each
|
|
iteration of the fragment shader uses alternating ``read'' and ``write''
|
|
buffers, as in:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/simplified_particle_data_flow.svg}{\includegraphics{diagrams/simplified_particle_data_flow.svg}}
|
|
\end{figure}
|
|
|
|
On the subsequent iteration of the same computation, state ``b'' would be read
|
|
and state ``a'' would be written.
|
|
|
|
For all prior fragment shader demos, I used the 32-bit \texttt{C4\_8} surface
|
|
format:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/c4_8.pdf}{\includegraphics{diagrams/c4_8.pdf}}
|
|
\end{figure}
|
|
|
|
Where 8-bit unsigned integer representations of blue, green, red, and alpha
|
|
could be stored in C0, C1, C2, and C3 respectively (or any other arbitrary
|
|
color component ordering).
|
|
|
|
R500 also supports a 128-bit \texttt{C4\_32\_FP} surface format:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/c4_32_fp.pdf}{\includegraphics{diagrams/c4_32_fp.pdf}}
|
|
\end{figure}
|
|
|
|
Where each component contains a 32-bit floating point value. Compared to 8-bit
|
|
integers, this increase in precision makes the format more useful for
|
|
generalized computation.
|
|
|
|
R500 conveniently also has an equivalent 128-bit per texel, 32-bit floating
|
|
point per component, 4-component texture format.
|
|
|
|
\subsubsection{Particle simulation data model}
|
|
|
|
I decided a minimal but still ``mildly interesting'' particle system would need
|
|
at least the following state:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/particle_system_data_model.c}{\includegraphics{verbatim/output/particle_system_data_model.c.pdf}}
|
|
\end{figure}
|
|
|
|
\texttt{age} is used to both ``reset'' the particle after some time (allowing
|
|
the simulation to repeat indefinitely) and to give the particles non-uniform
|
|
reset timing. \texttt{random} is used to further make the behavior of each
|
|
particle less uniform. At the start of the particle simulation, all values are
|
|
randomly initialized.
|
|
|
|
This data model requires 8 components in total, which is more than the 4
|
|
components provided by both the pixel shader output surface format as well as
|
|
the texture sampler texel format. However:
|
|
|
|
\begin{itemize}
|
|
\item R500 fragment shaders can have up to 4 independent render targets
|
|
\item R500 fragment shaders can sample from up to 16 independent textures
|
|
\end{itemize}
|
|
|
|
Following this model, it makes sense to break up the data structure like this:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/particle_system_data_model_split.c}{\includegraphics{verbatim/output/particle_system_data_model_split.c.pdf}}
|
|
\end{figure}
|
|
|
|
Where each fragment samples from two separate texture buffers, and has two
|
|
separate render targets as output:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/simplified_particle_data_flow_split.svg}{\includegraphics{diagrams/simplified_particle_data_flow_split.svg}}
|
|
\end{figure}
|
|
|
|
\subsubsection{Drawing particles}
|
|
|
|
I decided to draw particles using the R500's ``quad list'' primitive. In a
|
|
non-fragment-shader-computed version of my particle simulation demo, I sent the
|
|
particle position as a vertex shader constant, as in:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/particle_system_cpu.cpp}{\includegraphics{verbatim/output/particle_system_cpu.cpp.pdf}}
|
|
\end{figure}
|
|
|
|
The vertex shader is then able to calculate the quad vertex positions using a
|
|
vertex shader program that is equivalent to this GLSL code:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/particle_system_position.glsl}{\includegraphics{verbatim/output/particle_system_position.glsl.pdf}}
|
|
\end{figure}
|
|
|
|
This works reasonably well for small particle system demos where particle
|
|
position is calculated on the CPU. However, the goal is to compute (much larger)
|
|
particle system positions via the pixel shader, and it would be highly preferred
|
|
that the particle system state never leaves R500 VRAM. In the latter case, the
|
|
``combine quad position coordinates with particle position coordinates via
|
|
vertex shader constants'' approach does not work for several reasons:
|
|
|
|
\begin{itemize}
|
|
\item R500 constant memory has 256 vectors; I'd like to make particle systems
|
|
with at least 100,000 particles.
|
|
\item The R500 pixel shader is not able to write to vertex shader constant
|
|
memory, it can only write to texture memory.
|
|
\item The \texttt{radeon} Linux kernel module generates a segmentation fault
|
|
in kernel space when given an indirect buffer larger than ~2MB (a Linux
|
|
kernel bug, not a R500 hardware limitation). Including the overhead of
|
|
multiple \texttt{3D\_DRAW\_IMMD} commands and vertex constant transfers,
|
|
100,000 particles is easily larger than 2MB of indirect buffer.
|
|
\end{itemize}
|
|
|
|
The only remaining option is to store particle position coordinates as a vertex
|
|
buffer. However, because I am drawing quads, despite the R500 vertex fetcher's
|
|
generous flexibility, the particle state buffer can't be used directly by the
|
|
vertex shader because it only operates on individual vertices.
|
|
|
|
For example, if a particle is at position \texttt{(4.0, 5.0, 6.0)}, the data
|
|
that needs to be sent to the vertex shader should be:
|
|
|
|
\begin{figure}
|
|
\href{verbatim/particle_position_vertex_shader_example.c}{\includegraphics{verbatim/output/particle_position_vertex_shader_example.c.pdf}}
|
|
\end{figure}
|
|
|
|
Or, \textit{expressed as a texture}, the desired transformation is:
|
|
|
|
\begin{figure}
|
|
\href{diagrams/texture_grid.pdf}{\includegraphics{diagrams/texture_grid.pdf}}
|
|
\end{figure}
|
|
|
|
While the R500 pixel shader unit can't itself perform this transformation
|
|
directly, the transformation can indeed be achieved by ``scaling'' the particle
|
|
state via the R500 setup engine and fragment interpolators using point texture
|
|
sampling.
|
|
|
|
Doing this via point texture sampling is absolutely critical, because a linear
|
|
interpolation between the state of two adjacent-in-memory particles is a
|
|
completely meaningless operation in this context.
|
|
|
|
This is implemented in
|
|
\href{https://git.idk.st/bilbo/r500/src/branch/main/src/particle_oriented_animated_quad_vbuf_pixel_shader.cpp#L583}{\texttt{\_copy\_to\_vertexbuffer}}
|
|
as simply ``rendering'' the particle positions into a viewport that is 4x wider
|
|
than the width of the original particle state texture.
|
|
|
|
\subsubsection{The complete rendering pipeline}
|
|
|
|
All buffers in the following diagram are entirely stored in R500 texture memory,
|
|
and are never transferred to x86 RAM.
|
|
|
|
\begin{figure}
|
|
\href{diagrams/complete_particle_data_flow.svg}{\includegraphics{diagrams/complete_particle_data_flow.svg}}
|
|
\end{figure}
|
|
|
|
The full rendering pipeline implementation is committed as
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_oriented_animated_quad_vbuf_pixel_shader.cpp}{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}.
|
|
|
|
The full particle simulation pixel shader implementation is committed as
|
|
\href{https://git.idk.st/bilbo/r500/src/commit/fdff78f1ad/src/particle_physics.fs.asm}{particle\_physics.fs.asm}.
|
|
|
|
\subsubsection{Demo video}
|
|
|
|
Speed comparison of my test system's Pentium 4 CPU and R500 pixel shader
|
|
simulating the same particle system (131,072 particles):
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/cpu_particle_simulation.png}
|
|
\caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf.cpp}\\(CPU -generated particle system)}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/pixel_shader_particle_simulation.png}
|
|
\caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}\\(pixel shader -generated particle system)}
|
|
\end{figure}
|
|
|
|
A more colorful variant of the same particle system demo (65,536 particles):
|
|
|
|
\begin{figure}
|
|
\includegraphics{videos/pixel_shader_particle_simulation_color.png}
|
|
\caption*{R500 DVI capture, \texttt{particle\_oriented\_animated\_quad\_vbuf\_pixel\_shader.cpp}\\(pixel shader -generated particle system)}
|
|
\end{figure}
|
|
|
|
It is exciting for me to realize that this ``perform generalized computations
|
|
via R500 pixel shaders'' technique has myriad other possible applications.
|
|
|
|
\end{document}
|