SUBLEQ Processor + SRAM
For the integrated electronics (EE330) final project, my partner Jakub Hladik and I built a “SUbtract A from B and Branch to C if LEss than or EQual to zero” processor. It is a one instruction set computer. The idea came from this Wikipedia article describing the abstract idea of this machine. Mathematically it seemed to be possible so we set out to make the world's first layout implementation of this simple processor. The processor works on the basis of bit manipulation. Each memory location can store an arbitrary integer and the instructions themselves reside in the memory as a sequence of integers.
For the integrated electronics (EE330) final project, my partner Jakub Hladik and I built a “SUbtract A from B and Branch to C if LEss than or EQual to zero” processor. It is a one instruction set computer. The idea came from this Wikipedia article describing the abstract idea of this machine. Mathematically it seemed to be possible so we set out to make the world's first layout implementation of this simple processor. The processor works on the basis of bit manipulation. Each memory location can store an arbitrary integer and the instructions themselves reside in the memory as a sequence of integers.

SRAM: 256x8 Bit
For the build, my role was to create the 256 x 8 bit SRAM memory. This was used to store the initial program before execution and store the result after. My design for the SRAM was a 6 transistor cell that could read and write a bit. For implementation into a layout, I constructed the matrix with row and column decoders to access each byte of memory. The matrix of the memory consisted of 4 columns of 64 rows. My partners job was to create the processor to interact with my memory. He was able to synthesize his layout from the constructed VHDL code, modeling the operation described by the one instruction machine. My layout was done by hand as to condense the design as much as possible. I developed a highly optimized 6T cell and a binary method to connect all 2048 different decoding wires effectively. We were working on a 0.5um process and had an area of 40 x 40 lambda to work with for manufacturing from MOSIS.
For the build, my role was to create the 256 x 8 bit SRAM memory. This was used to store the initial program before execution and store the result after. My design for the SRAM was a 6 transistor cell that could read and write a bit. For implementation into a layout, I constructed the matrix with row and column decoders to access each byte of memory. The matrix of the memory consisted of 4 columns of 64 rows. My partners job was to create the processor to interact with my memory. He was able to synthesize his layout from the constructed VHDL code, modeling the operation described by the one instruction machine. My layout was done by hand as to condense the design as much as possible. I developed a highly optimized 6T cell and a binary method to connect all 2048 different decoding wires effectively. We were working on a 0.5um process and had an area of 40 x 40 lambda to work with for manufacturing from MOSIS.
6T Cell
This component was the heart of my SRAM since it was what stored the bits. A lot of time was spent in the designing stage for this component to get the area minimized and performance optimal. This was crucial since any mistake in this step would be magnified when the cell was duplicated 2048 times. I looked at a few different designs before going forward with the 6 transistor cell such as the 5T (single access cell), 6T (double access) and 12T (latch) designs. In the end, I decided to go with 6T for its symmetric design for ease of connectivity and compatibility The cell can store 1 bit of data which can be represented as a 1 or 0 depending on the voltage level. The design of this module has 2 NMOS transistors that operate as access gates. When they are turned on, a path will be available for the bit lines to access the internal cross-coupled inverters. In the center of the cell are the cross-coupled inverters which are just NOT gates with feedback into one another. This saves a state when power is provided to the circuit by having a continuous inversion of the saved bit. For transistor sizing, I made the PMOS pull up network in the cross-coupled inverters to be the weakest in order for the write operation to overpower the transistors and switch the state. After that, I made the NMOS pull-down transistors in the cross-coupled inverter 4x larger than the PMOS to create some driving strength for altering the bit lines. Lastly, the two NMOS access transistors were made to be two times as strong as the PMOS so that they were a medium strength. Preliminary testing was done to get to this decision on how to size each transistor and it ensured both read stability and writability moving forward. The layout was constructed in such a way that the VDD rail and word line ran laterally. The VSS signal, bit line and, not bit lines ran vertically. By doing so the cells were easily stackable and able to connect horizontally and vertically. |
Write Drivers
This circuit provides the read and write functionality to interface with the memory by driving the bit lines to their desired state and reading off the bit lines. It has the largest transistors so that it can change the state of the entire bit lines quickly. It operates when write enable signal HIGH whatever is in DATA will be driven onto the bit lines and an inversion of the DATA is put on the not bit lines. During this time the desired word line is driven high to control the gates of access transistors. This allows us to write the bits into the desired address. This layout was created so that the lateral beams would be connected with the adjacent ones. This circuit fits under the lowest 6T cell and was 8 columns wide. |
Precharge
This circuitry was used to condition the bit lines when the clock is low. Conditioning the bit lines means to charge them to up to VDD so that you can do a current sense on one of the bit lines and know the value of the cell. If current flows on a bit line you know there was a 0 on that side of the cross-coupled inverter because the 1 on the bit line is being shorted through the access transistor through the NMOS of the inverter to ground. This layout was placed on the top of the 6T cells and was 8 columns wide. The lateral beams connect to adjacent cells and the vertical connection to the 6T matrix bit lines underneath. Below shows some testing I did with different precharge levels. It was found that the circuit was functional at all levels but performed best at a precharge of VDD so that was used in the final design. |
Decoder
The last component was an 8 to 256 decoder used to turn on the correct word line to access the correct address in memory. The design for this was a simple 8 input NAND gate followed by an inverter to create an AND gate to drive the word line. A significant delay was accumulated on this component but met our 1MHz clock specification. It was designed in the same vertical space of the cells to easily integrate the systems together.
The last component was an 8 to 256 decoder used to turn on the correct word line to access the correct address in memory. The design for this was a simple 8 input NAND gate followed by an inverter to create an AND gate to drive the word line. A significant delay was accumulated on this component but met our 1MHz clock specification. It was designed in the same vertical space of the cells to easily integrate the systems together.
An issue I had during this was the overwhelming problem of how to wire the 2048 decoding wires. I spent a lot of time contemplating the best way to do with options of: building an automator program, labeling the wires and perform some sort of auto-routing or my least favorable to route by hand. To effectively solve this problem I designed 8 masks patterned with a power of 2 binary sequence patterns. Each pattern was duplicated and aligned to the final layout. This drastically decreased the amount of work I had to do and simplified the process down to 8 easy steps. Routing of all 2048 wires was done in an hour instead of a week if it were to be done by hand.
Results & Testing
To test the circuit a symbol of the layout was created and voltage supplies were attached to it to simulate processor inputs. This is shown to the right. I next ran a transient analysis for a time duration and looked at the output pins to see if the reading and writing was working for the individual addresses. It was found that I could read and write into the 256 addresses. In the future, it would be nice to use industrial equipment to test reading and writing various patterns into the memory. |
Issues & Challenges:
- Deciding on a precharge voltage I ran numerous tests to determine that VDD would be what I would work with.
- Die space. It became prevalent that this circuit was not going to fit on MOSES 3000 lambda die. I found out that they make exceptions that 4 of these dies can be put together to get more space.
- Simulation was difficult to compare the schematic and layout. I found out how to just simulate the layout versions by adding extracted to the environment. This gave a more realistic simulation for how the circuit would behave if printed.
- The last difficulty was figuring out how to wire all the 2048 lines to map to the decoders. I came up with the 2’s multiple idea that saved lots of time and prevented error.
SUBLEQ Processor

Architecture Design
Jakub's role was to design the processor layout. He designed a very simple multi-cycle architecture consisting of four registers, a program counter, subtractor, and control logic. An abstracted diagram can be seen below.
During each cycle, one memory read or write operation occurs. The principle of operation can be split into six cycles as follows:
The instruction the processors performs every time could be transcribed as “subtract value A from B and go to C if the result is less or equal to zero”. This does not sound like much but this processors, provided it has enough memory, can solve any algorithmic problem. We can force branch by making the value C the same as the next program counter value, add by subtracting negative numbers and we can branch by providing a different value C. Sample testing program can be seen below.
Jakub's role was to design the processor layout. He designed a very simple multi-cycle architecture consisting of four registers, a program counter, subtractor, and control logic. An abstracted diagram can be seen below.
During each cycle, one memory read or write operation occurs. The principle of operation can be split into six cycles as follows:
- RegA = Mem[PC], PC = PC + 1
- RegA = Mem[RegA]
- RegBAddr = Mem[PC], PC = PC + 1
- RegB = Mem[RegBAddr]
- RegB = RegB – RegA, RegC = Mem[PC], PC = PC + 1
- Mem[RegBAddr] = RegB, RegB <= 0 ? PC = RegC
The instruction the processors performs every time could be transcribed as “subtract value A from B and go to C if the result is less or equal to zero”. This does not sound like much but this processors, provided it has enough memory, can solve any algorithmic problem. We can force branch by making the value C the same as the next program counter value, add by subtracting negative numbers and we can branch by providing a different value C. Sample testing program can be seen below.
Architecture Simulator
To verify the programs Jake wrote for this processor, he wrote a very simple simulator. It takes in a file with the machine language program (initial values of the memory), and simulates the execution of the program step by step.
To verify the programs Jake wrote for this processor, he wrote a very simple simulator. It takes in a file with the machine language program (initial values of the memory), and simulates the execution of the program step by step.
FPGA Prototype
Jake decided to make a FPGA prototype on the school’s Altera DE2-115. He created a test bench that allowed him to step through the cycles and verify the values going into and out of the memory. He discovered a few timing issue when running the design on a physical FPGA chip that made him change the type of memory used in the prototype (level-sensitive instead of edge-sensitive).
Jake decided to make a FPGA prototype on the school’s Altera DE2-115. He created a test bench that allowed him to step through the cycles and verify the values going into and out of the memory. He discovered a few timing issue when running the design on a physical FPGA chip that made him change the type of memory used in the prototype (level-sensitive instead of edge-sensitive).
CMOS Layout
After making final improvements on the FPGA prototype, Jakub moved to designing a CMOS layout. He used Cadence Encounter to synthesize, place, and route the core. The synthesized schematic can is shown on the next page. He utilized the pin planner functionality of Encounter to specify where the input and output pins should be. He also ran clock synthesis (1 MHz target speed) and ran the clock analysis. The final layout can be seen below.
After making final improvements on the FPGA prototype, Jakub moved to designing a CMOS layout. He used Cadence Encounter to synthesize, place, and route the core. The synthesized schematic can is shown on the next page. He utilized the pin planner functionality of Encounter to specify where the input and output pins should be. He also ran clock synthesis (1 MHz target speed) and ran the clock analysis. The final layout can be seen below.
Results & Testing
The result from the Cadence simulation showed a working simple processor running at 1MHz tested by implementing a “for loop”. The processor required 2500 transistors and the SRAM required 20,000 transistors. The processor dissipates 0.2mW of power. Even though it may be limited to its computing power of one instruction set, this processor could have many low power application if time isn't an obstacle.
Below shows some of the testing that was done to see if the layout implementation operates correctly.
The result from the Cadence simulation showed a working simple processor running at 1MHz tested by implementing a “for loop”. The processor required 2500 transistors and the SRAM required 20,000 transistors. The processor dissipates 0.2mW of power. Even though it may be limited to its computing power of one instruction set, this processor could have many low power application if time isn't an obstacle.
Below shows some of the testing that was done to see if the layout implementation operates correctly.
Issues & Challenges
- The OSU library has bugs in the pads that causes the design not to pass the DRC. After discussion with several experienced Encounter users on campus, we came to conclusion that the pads are poorly designed or poorly converted for use in Virtuoso.
- Jakub was not able to export a schematic netlist from Encounter into virtuoso that would successfully LVS with my design. However, he spend several hours looking for suspicious connections in both the layout and extracted version.
- Jakub was not able to simulate my extracted design. His simulation would run for a couple of minutes and then it would crash. He was working with ETG on resolving the issue, however, it was never resolved.
- Lack of documentation for this architecture made it especially challenging in the beginning. Started this project early on this semester so he could deal with architectural issues.

subleq_presentation.pdf | |
File Size: | 4079 kb |
File Type: |
Widget is loading comments...