Elegant C: 2016

Thursday, December 29, 2016

Book Review: The Art of Readable Code

I really enjoyed reading The Art of Readable Code. I always enjoy reading books on style, both English writing as well as source code. The earlier chapters were a greater highlight to me, as they focused on the basics of style and particularly were demonstrated through languages with which I am familiar. Some of the later chapters instead were examples particular to languages that I do not use, such as illustrating style pitfalls with JavaScript. The book also had value in showing several approaches and continuing to refactor the code to better meet style and readability.

While the topic is style, this is really more about fundamentally good practices, which may be implemented in one of several ways (e.g., where to put braces, camel case, etc) that is termed style. Between this and the examples within the text, I want to start requiring it of students. I want them to read and see why style actually matters. Or maybe we will just have to wait until they experience why it matters and suffer for it.

Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year. I still have not done much testing personally, but I did direct two students who explored and measured this support. As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks. Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report. Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code. Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive. This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates. Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach. However, their measurements revealed some counter-intuitive results. For example, an AVL tree is a self-balancing data structure. The self-balanced nature is a benefit in that fewer memory accesses should be required. Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation. We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache). As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort. Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions. In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads). In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations. It still remains the advice that simple atomic updates should utilize the appropriate instructions. What if you need to perform several of these operations? Again, we turn to the final report. The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction". Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Friday, October 14, 2016

Conference Attendance Teaching and Learning Summit 2016 - Keynote

Critical Thinking: Why is it so hard to teach? - Dr Daniel T. Willingham

Critical thinking is intertwined with content knowledge. We saw a sequence of four examples (If vowel then even number, if alcohol then 21, if gin, then haddock, if entering then cholera vaccine), for each example, there is a claim about a set of cards: If X then Y. Given four cards, verify the claim. If the problems were formulated based on permissions, then the success rate was high. Each problem is technically, P -> Q, but having just completed a semester of logic has no impact on results.

Scientific reasoning is taught in two pieces scientific concepts and scientific method. So consider designing a learning experiment. The group is split into intervention and control. How do you know that the random sample is valid? Background knowledge is required to determine the appropriateness of the split.

Critical thinking occurs from learning at the deep level. The surface story is say, "tumors and rays". The deep question is whether it is modus pones, Netwon's third law, etc? However, memory is focused on the surface facts. Recall is based on those components.

Why not teach the deep structure immediately? Abstractions are hard to understand. Instead, learners have to see lots of surface structures all overlaying the same deep structure.

Sometimes failures in critical thinking are actually failures in basic knowledge. Furthermore, there are also innate biases, such as words refer to objects and attributes, and the world is full of agents and purposes.

Takeaway 1: Most of critical thinking is domain-specific.

Takeaway 2: In each domain, faculty should identify what they consider the important critical thinking skills.

Takeaway 3: Select content with an eye toward teaching these skills. Teach the critical thinking in the context of the content.

Takeaway 4: Critical thinking is a curricular issue. These skills require more than 1 semester to acquire.

Takeaway 5: Certain foundational concepts may run counter to the mind's biases. Students have functional knowledge that has worked so far. For example, "equals sign means put answer here".

Q. Translating domain skills in interdisciplinary work?

A. Don't know. Probably needing to know enough of the skills in the home domain to be able explore the other domain.

Q. If critical thinking is domain specific, how specific are domains?

A. Domains are nested. Proper application requires domain knowledge. Moving from cognitive psychology to social leaves [the speaker] less skilled, but still better than average. Into clinical psychology, they have a common basis, but limited ability to apply.

Tuesday, September 6, 2016

When is it signaled?

Signals are a mechanism for notifying a process of a simple event. Most programmers and programs can ignore them, as the default behaviors are reasonable and taking steps to handle them would greatly increase the program complexity. But when teaching future computer scientists, we want the programmers to know about these mechanisms and therefore properly understand the functioning of the system.

In working with signals, the developing programmers are often exposed to their first dose of concurrency. The idea that execution can be happening in simultaneous, arbitrary orders except when action is taken by the program. With signals, a program can do several things:

Provide and install a handler for one or more signals
Block the receipt of one or more signals
Send a signal to one or more processes

We are going to consider the problem of sending and receiving a signal. After a signal is sent, when does the other process receive it? Students make the assumption that when the sending function (i.e., kill) has returned, the signal has been sent *and received*. However, I have found no text that explicitly guarantees this condition. Instead, I prepared a simple program (source at the end of this post) to test this condition.

As signals are communicated through the operating system, we want a different mechanism for measuring simultaneity, in this case shared memory. The experiment program will create and set up a small space of shared memory between two processes. Next, it will wait until both programs are running in a known state (i.e., barrier). Then one (the parent) will signal the other (the child), while measuring how long it takes to send, as well as receive the signal. Finally, run this experiment a million times.

On my current Ubuntu box with Skylake processors, the experimental measurements show that 80% of the time, the child lasts for an average of 40 cycles after kill returns. The maximum time is almost 1200 cycles. Assuming that each core's clock has a smaller skew, this effectively means that the child can continue to run even after the function has returned.

Source code follows: (compiled with gcc version 4.8.3, with gcc -O3 -lrt)
#include <sys/mman.h>
#include <sys/types.h>

#include <sys/wait.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdint.h>

#include "rdtsc.h"

static const int ITER = 1000 * 1000;

int main(int argc, char** argv)
{
int shm_fd = shm_open("squig", O_CREAT | O_RDWR, (S_IREAD | S_IWRITE));
int count = 0, i;
uint64_t zero = 0, min = ~0x0, max = 0, sum = 0;
uint64_t minN = ~0x0, maxN = 0, sumN = 0;
write(shm_fd, &zero, 8); // Give the shared file "space"
void* msh = mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
volatile uint64_t* tsc = msh;

for (i = 0; i < ITER; i++)
{
*tsc = 0;
pid_t chd = fork();

if (chd == 0)
{
// Give the compiler something to think about
while(zero < ~0x0) { *tsc = rdtsc(); zero++;}
}
else
{
// Wait for the child
while (*tsc == 0) {}

uint64_t st = *tsc; // When is it for the child?
kill(chd, SIGINT); // Send SIGINT to child
uint64_t k = rdtsc(); // Returned from kill
wait(NULL); // Reap
uint64_t delta = 0;
uint64_t en = *tsc; // When did the child end?

// K >, implies that kill returned after the child terminated
if (k > en)
{
count ++;
delta = k - en;
if (delta < minN) minN = delta;
if (delta > maxN) maxN = delta;
sumN += delta;
}
else
{
delta = en - k;
if (delta < min) min = delta;
if (delta > max) max = delta;
sum += delta;
}
}
}

printf("Min: %lx, Max: %lx, Avg: %lx\n", min, max, (sum / (ITER - count)));
printf("Min: %lx, Max: %lx, Avg: %lx\n", minN, maxN, (sumN / (count)));
printf("Percent Parent After: %lf\n", (count / (double)ITER));

return 0;
}

Update: Results also hold when sending SIGKILL.

Thursday, June 30, 2016

Computer Scientists and Computers Usage

This post is built on a discussion I had with Dr. Thomas Benson and the recent blog post by Professor Janet Davis, I've got a POSSE. Less directly, Professor Mark Guzdial has written several recent blog posts about parents wanting CS in high school, most recently this (where the point is raised that not everyone knows what this is).

What is it to be a computer scientist? Is it just software development? When I was first looking at colleges and I knew I wanted to program, I saw three majors that seemed appropriate: Computer Science, computer programming, and game development. And from my uninformed perspective, Computer Science (CS) initially seemed the least applicable. This was before Wikipedia, so how would one know what CS is?

Michael Hewner's PhD defense was a study of how undergrads perceive the field, as they learn and progress through the curriculum, do their views change? I know my views changed; for example, functional programming was a complete unknown before matriculating. By the completion of my undergraduate degree, I perceived that Computer Science has three pillars: systems, theory, and application. I still view CS primarily from a programmer's lens and not a big tent view.

Indirectly, the Economist wrote about Programming Boot Camps where college grads go back to get a training in programming; however, "Critics also argue that no crash course can compare with a computer-science degree. They contend that three months’ study of algorithms and data structures is barely enough to get an entry-level job." I both agree and disagree. There is a further transformation of the economy coming whereby workers will . One PL (programming language) researcher recently estimated that Excel is the most common programming language. Can the worker using Excel do more or do it faster with VB or python or ...?

Are there nuances to the study of Computer Science in which students can focus further? Clearly there are with the mere presence of electives. Georgia Tech even groups similar electives into "threads". Besides just electives, under a programmer-centric view of CS, software development involves more than just the strict programming aspect. Even if CS is the "programming", a written program requires software engineering to maintain and manage its development and designers to prepare the UI (etc).

Thus CS might evolve more toward having a pre-CS major, and after the first two(?) years, students can then declare as Computer Science or Software Engineering or Human Computer Interaction or .... So this is both less and more than Georgia Tech's threads, but an approach with similarities. In common, this development of CS major(s) balances a concern of whether students have the fundamentals to succeed in areas related (i.e., programming, software development, and design) to their actual major, while allowing some greater depth and specialization.

This step is a maturing of the discipline. No long just one major, but a school of majors. CMU offers BS in CS, along with three other interdisciplinary majors. Georgia Tech offers CS (with 8-choose-2 threads) and Computational Media. Rose-Hulman offers both CS and Software Engineering. (I am only citing the programs that I know about, not an exhaustive list).

Friday, May 6, 2016

Tail Duplication with LLVM

Let me start by saying that tail duplication exists in LLVM, but only as a backend, machine code optimization. I am going to detail several engineering points that were required to support this operation strictly in the LLVM intermediate representation (IR).

To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).

bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}

The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.

The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.

Tail duplication follows two basic steps that LLVM provides support to accomplish:

Clone the basic block
Merge cloned blocks with the predecessors

Simple.

Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.

And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.

Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.

I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.

With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.

Monday, March 21, 2016

PhD Defense - Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Subsystems

Vivek Seshadri gave his PhD defense today covering: how different memory resources have different granularities of access, and therefore need different management techniques. These techniques come out of understanding what hardware could do, without necessarily identifying common features in existing applications that would require / benefit from these techniques.

Page Overlays / Overlay-on-Write: provide the ability to assign physical addresses at sub-page granularities (call them overlays). This reduces the cost of sparse copy-on-writes. In effect, assign a fresh sub-page unit and copy to that location. On every access, check the overlay table in parallel to determine whether to use the normal translation or go to the overlay location.

Gather-Scatter DRAM: provide support for only requesting a subset of cachelines. First, shuffle the data in a cacheline so that the same subset of multiple cache lines will map to different chips in DRAM. Second, introduce additional logic (just a few gates) in DRAM that will compute a modified address, where the default pattern (stride 1) is the normal, un-modified access.

RowClone + BuddyDRAM: can DRAM speedup memcpy (and other bulk memory operations)? First, by opening one row after another, the bitline will take the initial value and then write it into another row. More complex is opening multiple rows simultaneously, which results in bit-wise operations across the three rows: final = C (A | B) | ~C (A & B). By controlling C, bulk bitwise operations are possible. Using this technique, the system can exceed the memory bandwidth for these operations.

DirtyBlock Index: the problem is that if the source is dirty, then it needs to be written back before the previous techniques can be used. DBI provides a faster lookup mechanism to determine if / where are any dirty block lines.

These techniques are interesting, but as the candidate noted, they are in effect solutions in search of a problem. And with DRAM being commodity hardware, it is difficult to envision these techniques being adopted without further work.

Wednesday, March 9, 2016

Repost: How I spent my time at SIGCSE

When I attend a conference, I try to prepare blog posts detailing the presentations that I see and other items of objective content. More important are the people in attendance. I spend more time meeting with colleagues than I do actually sitting in the sessions. Before you are shocked, understand that much of our conversation are about these sessions. Depending on the conference there can be between 2 and 10? sessions occurring concurrently. I cannot be in 2, let alone 10, places at once, so instead we sample (as in the appearance of randomly selecting) the sessions and then discuss.

I met Janet Davis, who now heads the CS program at Whitman College, during my first time at SIGCSE. I value her different perspective and always try to seek her out at some point during the conference.

These themes merge in her blog post that shows some idea as to how each day is organized and how the sessions (papers, panels, etc) play a part in a very busy schedule.

Friday, March 4, 2016

Conference Attendance SIGCSE 2016 - Day 2

After lunch when we are all in food comas, let's attend the best paper talk!
A Multi-institutional Study of Peer Instruction in Introductory Computing -
This study followed 7 instructors across different institutions as they used peer instruction. This showed that both the instruction is generally recognized as valuable, while also touching on routes in which it can go awry. Tell students why this technique is being used and what it's effect. Hard questions are good questions to ask, as students will discuss and learn from the question. This requires that questions are graded for participation and not *correctness*. Possible questions and material for peer instruction is available.

Development of a Concept Inventory for Computer Science Introductory Programming -
A concept inventory is a set of questions that carefully tease out student misunderstandings and misconceptions. Take the exams and identify both the learning objective and the misconception that results in incorrect answers.

int addFiveToNumber(int n)
{
int c = 0;
// Insert line here
return c;
}

int main(int argc, char** argv)
{
int x = 0;
x = addFiveToNumber(x);
printf("%d\n", x);
return 0;
}

a) scanf("%d", &n);
b) n = n + 5;
c) c = n + 5;
d) x = x + 5;

Each incorrect answer illustrates a different misconception. For example, input must come from the keyboard. Or variables are passed by reference.
Overall, this study illustrated how the concept inventory was developed, but not the impact of having it, or what it showed in the students and their learning.

Uncommon Teaching Languages - (specifically in intro courses)
An interesting effect of using an uncommon language in an introductory course is that the novices and experts have similar skills. Languages should be chosen to minimize churn, otherwise students feel that they haven't mastered any languages. And related to this point, languages also exist in an institutional ecosystem. Furthermore, we want to minimize the keywords / concepts required for a simple program. A novice will adopt these keywords, but they also are "magic" and arcane. And then how long are the programs, as we want novices to only have to write short code to start.

I also attended the SIGCSE business meeting and then the NCWIT reception. I have gone to NCWIT every year at SIGCSE, as I want to know what I should do (or not do) to not bias anyone's experience in Computer Science.

Thursday, March 3, 2016

Conference Attendance SIGCSE 2016 - Day 1

Here I am at SIGCSE again. This is a wonderful opportunity to think and reflect on how I assist students in learning Computer Science and to be Computer Scientists. And to connect with other faculty, researchers, etc who are interested in teaching and doing so in a quality manner.

An Examination of Layers of Quizzing in Two Computer Systems Courses -
In this work, the instructor taught the Intro Computer Systems course and based on Bryant and O'Hallaron's book (paid link). After several years of teaching, she introduced a new layer of quizzing to the course. Effectively before each class, students take a pre-quiz worth ~0% of their grade (20 quizzes combine to 5%), and can then come to class with knowledge and feedback toward their deficiencies. From the experience of the quizzes, students have been doing better in these courses.

Subgoals Help Students Solve Parsons Problems - (previewed at Mark Guzdail's blog)
When learning new things, students benefit from labeling subgoals in solving. These labels provide a basis for solving similar problems. There are two different strategies for labeling: students can provide the labels or the assignment can provide the labels. An example labeling can be found with loops: initialize, test, change. If students provide the labels and provide cross-problem labels, they do best. If they provide the labels and they are problem-specific such as "are there more tips" (with respect to an array of tips), then these students do worse than those provided the labels. Developing labels can be valuable, but it may require the expert to still provide guidance to help abstract them across problems. This talk had one of the great moments when someone asked a question and Brianna replied by, "So and so has done great ..." And the questioner pointed out that he is "so and so".

As CS Enrollments Grow, Are We Attracting Weaker Students?: A Statistical Analysis of Student Performance in Introductory Programming Courses Over Time -
In this study, one instructor has analyzed the data of student assignment grades across 7 years of Fall semesters in the CS 1 course. Several specific and clear reasonings were applied to get a clear and comparable data set. The first test is that the number of student withdrawals remained the same as a percentage of the total class size. The second test is that the means of the grades for the courses are statistically indistinguishable. The third test is to use a mixture model (weighted combination of distributions) for each class's scores. A good fit is found with two gaussian distributions, such that there is one for the "good students" and a second for the high variance students who are "potentially weaker". From this, the study concluded that (at Stanford, in Fall CS1), there are more "weak students" and more "strong students" as the student enrollment is drawing from the same larger population.

A (Updated) Review of Empiricism at the SIGCSE Technical Symposium -
Using the proceedings from SIGCSE 14 and 15, they examined the empirical evaluation and the characteristics of these evaluations. How was the data collected in each paper? And what was being evaluated (pedagogy, assignments, tools, etc)? Is the subject novel or replicating other studies? Based on this study, would SIGCSE benefit from a separate track for longer paper submissions? Or workshops on how to empirically validate results? This and other material is being developed under an NSF grant and released publically.

Birds of a Feather -
In the evening, I attended two Birds of a Feather sessions. Both of which have given me further ideas for what I might do to further (attempt to) improve student learning. And also possible collaborators toward that end.

Tuesday, January 19, 2016

Conference Attendance HiPEAC - Day 2 - Papers

Another conference day. Much of my time is spent talking with other attendees and doing "work", such as preparing my presentation, send emails, etc. However, I do take some time to actually sit in on other presentations, so here are two highlights:

PARSECs - This work explores rewriting some of the PARSEC benchmarks to use a task-based parallelism (OpenMP tasks), rather than pthreads. For many workloads, these changes provide improved scaling. For almost all workloads, the code size was reduced as the original thread pools, job queues, etc could be removed. In the near future, these revised versions should be released.

HRF-Relaxed - The original OpenCL had no memory model; however, many vendors implemented one. Now, C++ and other languages use SC for DRF (sequential consistency for data-race-free programs). Unfortunately, if you use this consistency model in OpenCL, you will lose performance. Instead, this work proposes a hierarchical race-free model, where the races are only checked at a certain scope of the program.

Monday, January 18, 2016

Conference Attendance HiPEAC - Day 1 - MULTIPROG

It is once again, conference time. For North Americans, this might seem rather early as I am writing from Prague, Czech Republic (or at least when I started 12 hours ago). I am attending HiPEAC, which is the premier European computer architecture conference. HiPEAC is a dual-track conference. Throughout the three days there is the paper-track, where the accepted papers to TACO (such as mine) are presented. And simultaneously there are workshops. For the first day, I am starting with the MULTIPROG workshop, which is on Programmability and Architectures for Heterogeneous Multicores.

Let's start with the keynote, given by David Kaeli of Northeastern University.
- Concurrent execution of compute kernels
- Scheduling of kernels, deadlines
- Sharing / access to host memory (i.e., RAM)

The current model of using a GPGPU is that it runs 1 computation kernel; however, there are many problems that would better decompose into several separate kernels. It would also be valuable if there were further examples of these problems (i.e., benchmarks). Now, whenever you try running multiple anything on a computational resource, there is a runtime scheduling problem. Which should run to best complete the overall problem. A follow-on research question explores this question a cloud-based environment where the GPU may be shared across entirely independent compute kernels. This requires the kernels to be tagged with IDs to ensure that their memory is kept separate. All of this sounds as if we need an OS for the GPU.

Following the late-morning break, we heard next from MECCA (MEeting the Challenges in Computer Architecture) - 3Ps: parallelism, power, and performance. Consider parallel program annotations for describing the concurrency, runtime management of caches using the annotations to indicate the flow of data and transfer the data before it is required and with the appropriate coherence states and indicate when a block is dead and can be evicted from the cache.

Then there was lunch, resting from my flights, then networking, especially the part where I stood by my poster and discussed my research for 3 hours. Now to rest for day 2.

Wednesday, January 13, 2016

Repost: Avoid Panicking about Performance

In a recent post, another blogger related how a simple attempt to improve performance nearly spiraled out of control. The lesson is that always measure and understand your performance problem before attempting any solution. Now, the very scope of your measurements and understanding can vary depending on the complexity of your solution. And when your "optimizations" have caused the system to go sideways, it is time to take a careful appraisal of whether to revert or continue. I have done both, and more often have I wished that I reverted rather than continued. Afterall, it is better for the code to work slowly rather than not work.

Again, always measure before cutting.