A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.
The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.
I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?
All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.
A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Showing posts with label assembly. Show all posts
Showing posts with label assembly. Show all posts
Friday, September 14, 2018
Saturday, March 7, 2015
Conference Attendance SIGCSE 2015 - Day 2 / 3
I recognize
that Day 1 afternoon went “missing”. I
presented my poster and that consumed the sum total of my time. While I am happy with all that I achieved with my poster (writing IRB protocol, independent work, analyzing my teaching, et cetera), it was not considered as a finalist for the student research competition (SRC). Yet I received significant feedback and a number of follow-ons that I will have to try to evaluate the next time(s) I teach. I have been doing an excellent job of networking and speaking with my colleagues. And I have seen several exciting techniques to improve my teaching.
In traveling, take some time to prepare students. Let them know what to expect. For example, it is okay to miss some paper sessions, and even return to your room entirely. It is okay to ask questions 1:1. Find groups where people are being introduced and join in. Student volunteering, while takes time, also gives an additional individuals that you will know. Use the people you know to introduce you to others at the conference.
This is just what it sounds. A Ruby based framework that enables writing simple unit tests that will then be applied to a full simulation of the assembly executed.
The presenter(s) were not at this poster, but it showed a high quality interface for seeing the scheduling of threads according to different scheduling policies. The intent here was not to explore races and parallelism, but rather see how scheduling decisions are made in an OS.
I was not expecting this poster. You are walking along and then see 4 Raspberry Pi's all networked together. Raspberry Pis and HPC?! A small setup, but it is an interesting development that takes advantage of the low cost Pi and still provide an HPC platform for students.
Plastic parts all worked together to form replicas of Pascal's mechanical calculator. Interesting and student assembled.
Teams of 4
students, approach is evaluated on courses from three years of major (sophomore
on up). Teams are formed with CATME
(particularly using dissimilar GPAs in a group), as well as partner selection
(when possible). Students provide peer
evaluations after each stage of the project.
Significant data collection looking particularly at what students prefer
for to be the evaluation policy (between 100% of grade for the group’s work to
100% of the grade for the individual’s contribution). This question was taken repeatedly throughout
the semester, which leads to whether student preferences change? More senior students prefer more weight being
attributed to group. The predictor for
what grade split is at what point in the course is this surveyed, and effectively
as soon as the teams are formed the students prefer to be graded primarily as a
group. Follow on study is looking at
experience with team projects, trust in the ability to evaluate individual
contribution, and other questions. This
is a hopeful data point.
How do faculty become aware and why do they try out teaching practices? 66 participants in CS, including chairs, tenure-track faculty, teaching faculty, and Ph.D. student instructors across 36 institutions. First, the mental model of what an instructor does can differ significantly from what the instructor is actually doing. Second, faculty can find out about practices through a variety of approaches, such as self-identifying that there is possible improvement in their teaching. Faculty often trust other faculty like them (researchers to researches, lecturers to lecturers). Third, when adopting a practice, faculty need to evaluate the effectiveness (see also my poster, student feedback, etc). -- My efforts in this have been having different faculty (my recommendation letter writers) view my lectures / teaching, and thereby giving them demonstrations of different practices.
"We lost the war on cheating" Instead, we have to meet with students such that they are demonstrating their understanding of the code. The requirements of submissions: attribute your sources and understand your submission. Enables students to work together, use all sources, develop interview skills. Enables reuse of assignments. Grading is now 40% correctness / 60% code interview. Rubric for each interview. Students should arrive early and have their laptop ready to present / explain. Students were better able to learn and complete the assignments, as well as feedback for improvement. Students also felt better able to learn the material by being able to collaborate and not constrained by a collaboration policy. There are some stressors, such as TAs having to meet with hundreds of students, as well as their inconsistencies. -- This was perhaps the most exciting new technique that I saw / heard about.
Tuesday, March 18, 2014
Turing Complete x86 Mov
Here is a small work that showed the power (or perhaps horror) of modern assembly languages. I've told people before that an instruction to "subtract and branch if negative" is sufficient for all programming. But never would have I imagined that x86's mov is also sufficient. Until you think about what Intel lets you do:
- Set the instruction pointer
- Compute arithmetic expressions via address calculation
- Compare two values by using their corresponding locations
Tuesday, February 4, 2014
Book Review: ARM Assembly Language: Fundamentals and Techniques
Besides reading an average of 5 research papers every week, I also read an average of one book each week. Occasionally those books relate to computers and usually then I'll write about them here. I had realized a couple months ago that I didn't really know anything about ARM processors, besides that they are low power. It seemed remiss to be studying computer architecture and not know one of the modern architectures. Thus I visited my school library and checked out a book.
This is the story of that book - ARM Assembly Language: Fundamentals and Techniques. An interesting book that covered the basic of what I wanted to learn, but the short coming was that it had an expected environment that was different from mine. ARM processors can be found in a greater diversity of devices than say, x86. Yet, I am still thinking about the ARM processor as a drop-in replacement. I look more to devices like Microsoft's Surface or a smartphone, and think about the presence of an OS, etc.
I learned particularly that the ARM instructions have bits to make them predicated. And I realized then that conditional branches are really just predicated instructions. If the predicate(s) are true, then take the branch. Just another perspective on instruction sets. Anyway, I look forward to getting a Raspberry Pi, so I can try out some of what I've learned and get a chance to also work through the assembly generated by compilers.
This is the story of that book - ARM Assembly Language: Fundamentals and Techniques. An interesting book that covered the basic of what I wanted to learn, but the short coming was that it had an expected environment that was different from mine. ARM processors can be found in a greater diversity of devices than say, x86. Yet, I am still thinking about the ARM processor as a drop-in replacement. I look more to devices like Microsoft's Surface or a smartphone, and think about the presence of an OS, etc.
I learned particularly that the ARM instructions have bits to make them predicated. And I realized then that conditional branches are really just predicated instructions. If the predicate(s) are true, then take the branch. Just another perspective on instruction sets. Anyway, I look forward to getting a Raspberry Pi, so I can try out some of what I've learned and get a chance to also work through the assembly generated by compilers.
Friday, June 21, 2013
What is Time: Keeping Time in Software
In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application. However, sometimes what you want is to measure an interval of time. We can generally agree to do this by taking two measurements: the time before and the time after. Then find the difference, which is the interval of time elapsed. Great! But how exactly are we going to measure time?
A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.
But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:
This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.
But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?
But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.
But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.
These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).
Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.
As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.
Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:
In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.
A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.
But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:
#include <stdio.h>
#include <time.h>
int main(int argc, char** argv)
{
struct timespec n, o;
clockid_t t = CLOCK_MONOTONIC;
clock_getres(t, &n);
printf("%d - %d\n",n.tv_sec, n.tv_nsec);
clock_gettime(t, &n);
clock_gettime(t, &o);
printf("%d - %d\n",n.tv_sec, n.tv_nsec);
printf("%d - %d\n",o.tv_sec, o.tv_nsec);
return 0;
}
% ./t_test
0 - 999848
13390932 - 798720095
13390932 - 798720095
This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.
But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?
But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.
But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.
These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).
Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.
As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.
Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature. Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).
In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.
Wednesday, June 5, 2013
Atomics versus Locks
I've written in the past about using intrinsics to improve performance. And I've discussed using this functionality to build your own synchronization. I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question. But sometimes a little knowledge is a dangerous thing. The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.
Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation. So let's test and see how the results turn out. First, the base code:
unsigned long long counter = 0;
int main(int argc, char** argv)
{
unsigned long long start, end, o;
start = rdtsc();
do {
counter++;
} while (counter < MAX);
end = rdtsc();
printf("%lld\n", (end - start));
return 0;
}
This version is the "Single Thread" result. But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded. By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration. Overhead, versus keeping the value in a register, like the assembly that follows:
loop: add $0x1,%rax
cmp $0x1dcd64ff,%rax
jbe loop
And the volatile code adds: mov 0x1429(%rip),%rax and a corresponding write after. Already this is enough to measurably impact performance. Let's move forward and switch to performing our increment using compare-and-swap.
do {
unsigned long long new, old = counter;
if (old < MAX)
{
new = old + 1;
__sync_val_compare_and_swap(&counter, old, new);
}
else
i = 0;
} while (i == 1);
Now, we only want to increment when we haven't reached max. If we have, then terminate. But the assembly is now unpleasant:
loop: mov 0x144b(%rip),%rax # 401a90 <counter>
cmp $0x1dcd64ff,%rax
jbe swap
...
swap: lea 0x1(%rax),%rcx
lock cmpxchg %rcx,0x140b(%rip) # 401a90 <counter>
jmp loop
First, the optimized code has split this into two pieces. The check and the increment, and these have been split. A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together. But the fact is, the impact of using the atomic operation looks like 6x.
As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead). Each version is then modified to spawn a second thread that does the same work as the first. All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):
What can we conclude? First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced. Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.
Second, but why did unoptimized lock work better? Because there was a different thread interleaving. Perhaps the OS put one of threads asleep and the other thread ran happily for a while. The point is that if you are synchronizing, then that is your overhead.
Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock. Yet, this test was highly artificial. The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization. Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730). Furthermore, how often is your synchronization a single variable?
To conclude, IF what you want to synchronize is a single variable, then using CAS can work. Have I done this? Yes. Do I normally? No. I have this tool in my programming toolbox, for those times that I really need it.
Anyway, why not use atomic increment instead? Because we want to ensure that the final value is exactly MAX, no more and no less.
Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation. So let's test and see how the results turn out. First, the base code:
unsigned long long counter = 0;
int main(int argc, char** argv)
{
unsigned long long start, end, o;
start = rdtsc();
do {
counter++;
} while (counter < MAX);
end = rdtsc();
printf("%lld\n", (end - start));
return 0;
}
This version is the "Single Thread" result. But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded. By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration. Overhead, versus keeping the value in a register, like the assembly that follows:
loop: add $0x1,%rax
cmp $0x1dcd64ff,%rax
jbe loop
And the volatile code adds: mov 0x1429(%rip),%rax and a corresponding write after. Already this is enough to measurably impact performance. Let's move forward and switch to performing our increment using compare-and-swap.
do {
unsigned long long new, old = counter;
if (old < MAX)
{
new = old + 1;
__sync_val_compare_and_swap(&counter, old, new);
}
else
i = 0;
} while (i == 1);
Now, we only want to increment when we haven't reached max. If we have, then terminate. But the assembly is now unpleasant:
loop: mov 0x144b(%rip),%rax # 401a90 <counter>
cmp $0x1dcd64ff,%rax
jbe swap
...
swap: lea 0x1(%rax),%rcx
lock cmpxchg %rcx,0x140b(%rip) # 401a90 <counter>
jmp loop
First, the optimized code has split this into two pieces. The check and the increment, and these have been split. A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together. But the fact is, the impact of using the atomic operation looks like 6x.
As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead). Each version is then modified to spawn a second thread that does the same work as the first. All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):
TEST - -O3 - -O0
Single Thread - 1 cyc / inc - 7 cyc / inc
Single Thread volatile - 6 cyc / inc - 16 cyc / inc
Single Thread w/ CAS - 38 cyc / inc - 46 cyc / inc
Single Thread w/ lock - 90 cyc / inc - 92 cyc / inc
Two Threads w/ CAS - 250 cyc / inc - 251 cyc / inc
Two Threads w/ lock - 730 cyc / inc - 630 cyc / inc
What can we conclude? First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced. Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.
Second, but why did unoptimized lock work better? Because there was a different thread interleaving. Perhaps the OS put one of threads asleep and the other thread ran happily for a while. The point is that if you are synchronizing, then that is your overhead.
Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock. Yet, this test was highly artificial. The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization. Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730). Furthermore, how often is your synchronization a single variable?
To conclude, IF what you want to synchronize is a single variable, then using CAS can work. Have I done this? Yes. Do I normally? No. I have this tool in my programming toolbox, for those times that I really need it.
Anyway, why not use atomic increment instead? Because we want to ensure that the final value is exactly MAX, no more and no less.
Wednesday, March 27, 2013
Transactional Memory and Intel's TSX
It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.
A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.
Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.
As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.
If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.
Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.
A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.
Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.
As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.
If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.
Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.
Friday, March 2, 2012
Usage of NULL
I read an argument recently about whether a test for NULL should be:
- or -
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
if (ptr)
- or -
if (ptr != NULL)
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
Tuesday, February 14, 2012
Bit Packing: A World Gone Mad
Let's start with a mythical program. It has some classes (structs, objects, etc) that are instantiated. Each field has a meaning given to it by the code. Many fields can hold far more data than is required of them. In our mythical program, these unused bits are of no concern, as memory is plentiful. But then the mythical program is run on a real machine and bad things happen. Maybe it runs slowly, or crashes when malloc() returns NULL. Now what?!
In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.
First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.
The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.
Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:
Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.
Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.
Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!
A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.
The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.
Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.
In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.
First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.
The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.
Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:
atomic_add(int* loc, int val) { lock() *loc += val unlock() }
Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.
Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.
Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!
A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.
[I/F:1][LENGTH:15][ADDR:48]
The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.
Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.
Thursday, December 30, 2010
A Pointer on Pointers Part 2
In the second part to the series on pointers, we'll be covering what they point to and how it relates to reality. Let's consider four pointers: void*, char[], struct*, and (void *)f(int) (i.e., a function pointer). With each pointer, we'll learn something further about C, assembly and how a processor interacts with memory.
To begin, there is the all purpose void* pointer (and yes, void* means pointer to void, but I'll keep writing void* pointer for emphasis). A C type that means a pointer to nothing, or anything. This pointer is important for allocating space and casting (changing the type of) the space. This second piece is something that only has representation in the programming language, in assembly every pointer has no type information. So by casting, the programmer tells the compiler that this block of memory is now to be treated differently. Therefore, void* pointers are used when the type is not known, (e.g., pthread_create(..., void* arg) or CreateThread(...,
A char[] is the next C type that will be discussed here. Every array in C is a pointer. Learn this fact. Internalize it. They are so identical that you can use them interchangeably, like so:
*(foo + 0) = '\0'; // 2
Line 1 and 2 are equivalent. So a pointer to many chars is an array of chars. And we can access any array offset with the pointer. Or we can use pointer arithmetic to access specific elements. Now, next time you see char*, you can think of it as an array of characters. (In a later post, we'll cover char** and other more complex pointers).
Working with the third type, a struct* pointer. Structs are merely logical arrangements of data that a programmer specifies. So accessing this data via a pointer will set up assembly to be at some offset from the base pointer. If you want to ignore this compiler support, such things are your purview. And in seeing how you can do this, we'll learn what the compiler is doing and what a struct* pointer is. We want a struct with two shorts, followed by a long. And we'll assign 'A', 'B', and 1024 to the three fields respectively.
You might be scratching your head, but option 1 and 2 do the same thing. Option 1 is how a programmer should write the code. Option 2 is roughly what the computer will actually be executing. So just remember that structs and struct* pointers are just logical arrangements of the data for your conveince.
Lastly, the function pointer cannot be ignored. A vital tool in modern architectures is the ability to call functions. Most calls use built in addresses (established at compiler time), but sometimes where the code wants to go isn't known until runtime. Function pointers are what enables inheritance in OO development, dynamic linking, and often finds its use in switch statements and other constructs. (Especially fun to use them with runtime code generation, but that's another post). And at their core, a programmer is merely telling the computer that execution should continue at some address.
To summarize, memory is just sequences of bytes that are given meaning by the types and usage that exist in the source code and consequently the program itself. To be different types of pointers is to look at memory through different lenses.
To begin, there is the all purpose void* pointer (and yes, void* means pointer to void, but I'll keep writing void* pointer for emphasis). A C type that means a pointer to nothing, or anything. This pointer is important for allocating space and casting (changing the type of) the space. This second piece is something that only has representation in the programming language, in assembly every pointer has no type information. So by casting, the programmer tells the compiler that this block of memory is now to be treated differently. Therefore, void* pointers are used when the type is not known, (e.g., pthread_create(..., void* arg) or CreateThread(...,
LPVOID lpParameter,
...)) or to discard existing type information.A char[] is the next C type that will be discussed here. Every array in C is a pointer. Learn this fact. Internalize it. They are so identical that you can use them interchangeably, like so:
char* foo = (char*) malloc(1024 * sizeof(char));
foo[0] = '\0'; // 1*(foo + 0) = '\0'; // 2
Line 1 and 2 are equivalent. So a pointer to many chars is an array of chars. And we can access any array offset with the pointer. Or we can use pointer arithmetic to access specific elements. Now, next time you see char*, you can think of it as an array of characters. (In a later post, we'll cover char** and other more complex pointers).
Working with the third type, a struct* pointer. Structs are merely logical arrangements of data that a programmer specifies. So accessing this data via a pointer will set up assembly to be at some offset from the base pointer. If you want to ignore this compiler support, such things are your purview. And in seeing how you can do this, we'll learn what the compiler is doing and what a struct* pointer is. We want a struct with two shorts, followed by a long. And we'll assign 'A', 'B', and 1024 to the three fields respectively.
typedef struct _pop2 {
short a, b;
long c;
} POP2, *PPOP2;
// Option 1
PPOP2 structP = (PPOP2) malloc(sizeof(POP2));
structP->a = 'A';
structP->b = 'B';
structP->c = 1024;// Or option 2
char* no_struct = (char*) malloc(8);
*(short*)(no_struct + 0) = 'A';
*(short*)(no_struct + 2) = 'B';
*(long*)(no_struct + 4) = 1024;You might be scratching your head, but option 1 and 2 do the same thing. Option 1 is how a programmer should write the code. Option 2 is roughly what the computer will actually be executing. So just remember that structs and struct* pointers are just logical arrangements of the data for your conveince.
Lastly, the function pointer cannot be ignored. A vital tool in modern architectures is the ability to call functions. Most calls use built in addresses (established at compiler time), but sometimes where the code wants to go isn't known until runtime. Function pointers are what enables inheritance in OO development, dynamic linking, and often finds its use in switch statements and other constructs. (Especially fun to use them with runtime code generation, but that's another post). And at their core, a programmer is merely telling the computer that execution should continue at some address.
To summarize, memory is just sequences of bytes that are given meaning by the types and usage that exist in the source code and consequently the program itself. To be different types of pointers is to look at memory through different lenses.
Tuesday, October 26, 2010
A Pointer on Pointers Part 1
People often express uncertainty about the working of pointers, structures, etc. They may not be my readers, but hearing their plight moves me to write nonetheless. Others have noticed this too, again looking at StackOverflow. Everything in this post is written from a C-centric point of view, unless otherwise noted.
There are two separate items to understand about pointers. First, where they come from. And second, what they point to. We will roughly be working off of the following picture of memory usage, but understand that this is so simplified that it is almost wrong.
Memory can be allocated from two places: stack and heap. Both are generalizations of the actual mechanics. Stack allocations are explicitly named variables in the source that are either made globally or local to a function. Global variables use space delineated in the binary image, rather than the stack pages, but they should be considered identically by a programmer in that the allocations are only good within their scope. This leads to my first example:
Managed language programmers often attempt the above example. newFoo() creates a foo and returns a reference; however, the referenced foo is on the stack. In case there is uncertainty, let's see the assembly for function newFoo():
Esp points to the end of the stack, so the subtract instruction changes the location pointed to, thereby setting aside the required space for an object foo on the stack, but the space is only available while in the function. When the cleanup section is reached, the space is "reclaimed" as esp is changed back. (The funny thing is, an optimizing compiler may inline newFoo() and therefore change the scope of this stack allocation). So to make an allocation persist beyond its allocation scope, we need the "heap".
Heap memory is memory that won't be reclaimed until the program explicitly requests (or termination). While malloc() is the most common method for heap allocations, it is not the only one. From these methods, a section of memory has been set aside for the program's usage. To again simplify our picture of computing, a Java implementation of newFoo would disassemble to something like the following:
A language like Java might prohibit making an allocation of an object foo on the stack and therefore force the allocation to made from the heap. And it does so by not using "sub esp, 0x10" to make the allocation but instead calling out to malloc() to set aside the necessary space. Now, out of this function, the address returned is not reclaimed.
An entire series of courses would be required to fully explain the trade-offs and reasoning, and since the length is growing long, I will conclude here and discuss the usage of this memory in a later post.
There are two separate items to understand about pointers. First, where they come from. And second, what they point to. We will roughly be working off of the following picture of memory usage, but understand that this is so simplified that it is almost wrong.
Memory can be allocated from two places: stack and heap. Both are generalizations of the actual mechanics. Stack allocations are explicitly named variables in the source that are either made globally or local to a function. Global variables use space delineated in the binary image, rather than the stack pages, but they should be considered identically by a programmer in that the allocations are only good within their scope. This leads to my first example:
struct foo* newFoo()
{
struct foo F;
return &F;
}
Managed language programmers often attempt the above example. newFoo() creates a foo and returns a reference; however, the referenced foo is on the stack. In case there is uncertainty, let's see the assembly for function newFoo():
; Using 32bit x86 assembly, as 64bit has calling convention differences
push ebp
mov ebp, esp
sub esp, 0x10 ; Allocate 16 bytes for a "foo"
mov eax, esp ; Put the address into the return value
mov esp, ebp ; Cleanup
pop ebp
ret
Esp points to the end of the stack, so the subtract instruction changes the location pointed to, thereby setting aside the required space for an object foo on the stack, but the space is only available while in the function. When the cleanup section is reached, the space is "reclaimed" as esp is changed back. (The funny thing is, an optimizing compiler may inline newFoo() and therefore change the scope of this stack allocation). So to make an allocation persist beyond its allocation scope, we need the "heap".
Heap memory is memory that won't be reclaimed until the program explicitly requests (or termination). While malloc() is the most common method for heap allocations, it is not the only one. From these methods, a section of memory has been set aside for the program's usage. To again simplify our picture of computing, a Java implementation of newFoo would disassemble to something like the following:
push ebp
mov ebp, esp
sub esp, 0x4 ; Space for arguments
sub esp, 0x4 ; Space for arguments
mov [esp], 0x10 ; "Push argument"
call malloc
mov esp, ebp ; Cleanup
pop ebp
ret
A language like Java might prohibit making an allocation of an object foo on the stack and therefore force the allocation to made from the heap. And it does so by not using "sub esp, 0x10" to make the allocation but instead calling out to malloc() to set aside the necessary space. Now, out of this function, the address returned is not reclaimed.
An entire series of courses would be required to fully explain the trade-offs and reasoning, and since the length is growing long, I will conclude here and discuss the usage of this memory in a later post.
Subscribe to:
Posts (Atom)