13.2 - Manipulating Data in a SAS Data Set
In this section, we'll review the many ways in which a SAS data set can be modified, including:
- reducing the number of observations in a data set
- reducing the number of variables in a data set
- creating or modifying a variable
- changing a variable's attributes
- creating an accumulator variable
- assigning values conditionally
While many of the methods will already be familiar to you, some you will see just for the first time.
Reducing the Number of Observations in a Data Set
There may be situations in which you'd like to exclude some of the observations in a SAS data set from an analysis that you are conducting or a report that you are generating. I like to think of doing such a thing as "making a data set shorter." Although there are other ways, we'll investigate two common ways of reducing the number observations in a data set here. We can use an IF-THEN-DELETE statement to exclude observations from the output data set. Or we can use a subsetting IF statement to include observations in our output data set. We will investigate yet another way of reducing the number of observations in our data sets in the next lesson.
Example 13.2. The following program uses an IF-THEN-DELETE statement to exclude golf courses whose par is 70 from the penngolf data set:
The IF-THEN-DELETE statement is pretty straightforward, and therefore probably doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that golf courses whose par is 70 were excluded from the temporary penngolf data set.
Example 13.3. The following program uses a subsetting IF statement to include only those golf courses whose par is greater than 70 in the penngolf data set:
Again, the subsetting IF statement is pretty straightforward, and therefore probably also doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that only golf courses whose par is greater than 70 were included in the temporary penngolf data set. You might also want to take note that the resulting data set is the same as the resulting data set from the previous example — just accomplished in a different way!
Reducing the Number of Variables in a Data Set
There may be situations in which you'd like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as "making a data set thinner." Although we'll investigate how to do this much more extensively in the next lesson, we'll at least introduce the concept by illustrating the use of a DROP statement.
Example 13.4. The following program uses a DROP statement to tell SAS to drop the Architect variable from the temporary output data set penngolf:
Self-explanatory again, eh? In the next lesson, we'll learn that rather than using a DROP statement to exclude variables, we could use a KEEP statement to include variables in the output data set. We'll also discuss the difference between the DROP and KEEP statements and the alternative DROP= and KEEP= options of the DATA and SET statements. For now, let's have you launch and run the program, and review the output to convince yourself that the Architect variable was not included in the output temporary penngolf data set.
Creating or Modifying a Variable
There is definitely nothing new here! Surely, you've created a new variable or two in the myriad SAS programs you have written by now. We include it here only for the sake of completeness, so it can be acknowledged as one way in which a SAS data set can be modified.
Example 13.5. The following program uses the Slope and USGA variables in conjunction with an assignment statement to create a new variable called Bogey:
Launch and run the program, and review the output to convince yourself that the Bogey variable was calculated as suggested by the assignment statement.
Changing a Variable's Attributes
Recall that one of the products of the compile phase is the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable's name, type, length, format, informat, and label. As you know, we can use a LENGTH statement to modify the length of a variable, a FORMAT statement to modify the format of a variable, and a LABEL statement to associate a descriptive label to a variable.
Example 13.6. The following program uses 1) a LABEL statement to associate descriptive labels to some of the variables in the penngolf data set and 2) a FORMAT statement to tell SAS how to display two of the variables (Bogey and Yards):
I personally have never seen the need for it, but a label can be as long as 256 characters. Seems a bit like overkill to me. Anyway, launch and run the program, and review the output to convince yourself that the variables were labeled and formatted as suggested.
Creating an Accumulator Variable
Now, this topic is definitely new to us! You might often find yourself in a situation in which you want to create a variable that accumulates the values of another variable. For example, if you have a data set that contains monthly sales, you might want to create a variable that contains year-to-date sales. To create such a variable, you'd most likely want to take advantage of what SAS calls an accumulator variable. To add the contents of a variable, or the result of an expression, to an accumulator variable, we need to use what SAS calls a sum statement.
The general form of a sum statement is:variable + expression;
- variable tells SAS the desired name of your accumulator variable (yeartodate, perhaps). The variable, which must be numeric, is automatically set to 0 before the first observation in the data set is read.
- expression is any valid SAS expression. It might be something as simple as the name of the variable that you want added up. One thing that is important to keep in mind is that if the expression produces a missing value, or a variable's value is missing, the sum statement treats it like a zero. This is in stark contrast to an assignment statement, in which SAS assigns the variable to the left of the equal sign (=) a missing value if any variable on the right side of the equal sign (=) contains a missing value.
In short, the sum statement adds the result of the expression that is on the right side of the plus sign (+) to the numeric accumulator variable that is on the left side of the plus sign. Then, as you can imagine would have to be the case in order to work properly, the accumulator variable's value is retained from one iteration of the DATA step to the next. That is, the accumulator variable is not set to missing as variables usually are when reading in data. Let's take a look at an example!
Example 13.7. The following program uses a sum statement and an accumulator variable called TotalYards to tell SAS to tally up the number of Yards that the golf courses in the penngolf data set have:
Launch and run the program, and review the output. So, let's see here, the value of TotalYards is set to 0 before the first observation is read. Then, as you can see, the value of TotalYards in the first observation becomes 0 plus 7018, or 7018. Then, the value of TotalYards in the second observation becomes 7018 plus 6525, or 13543. Then, the value of TotalYards in the third observation becomes 13543 plus 6392, or 19935. And so on.
Assigning Values Conditionally
You learned how to use IF-THEN-ELSE statements in Stat 480 as a way of assigning values to variables conditionally. Here, we'll explore the SELECT group as an alternative method.
Example 13.8. The following program uses a SELECT group to create a numeric variable called AssnFee that depends on the values of the character variable Type:
As you can see, a SELECT group is comprised of a SELECT statement, a series of required WHEN statements, an optional OTHERWISE statement, and a required END statement. The SELECT statement begins, and the END statement ends, a SELECT group. The easiest way to explain this SELECT group is probably just to compare it to the comparable IF-THEN-ELSE statement:if (Type = "Resort") then AssnFee = 5000;
else if (Type = "Private") then AssnFee = 4000;
else if (Type = "SemiPri") then AssnFee = 2000;
else if (Type = "Public") then AssnFee = 1000;
else AssnFee = .;
That is, in this example, we've told SAS that we want to compare the variable Type to all of the when-conditions appearing in the WHEN statements. If the value of Type for a particular observation matches a when-condition, then the statement that follows that when-condition is executed. If the value of Type for a particular observation does not match any of the when-conditions, SAS executes the OTHERWISE statement.
Launch and run the program, and review the output to convince yourself that the SELECT group functions just like above the IF-THEN-ELSE statement. You might also want to note in this example that we've optionally taken a shortcut of including the variable Type in parentheses in the SELECT statement. This shortcut tells SAS to compare the variable of Type to the values in the parentheses in the WHEN statements. If we did not take this optional shortcut, our SELECT group would have to be written as follows:select;
when (Type = "Resort") AssnFee = 5000;
when (Type = "Private") AssnFee = 4000;
when (Type = "SemiPri") AssnFee = 2000;
when (Type = "Public") AssnFee = 1000;
otherwise AssnFee = .;
You might want to edit and re-run the program to convince yourself that the two SELECT groups are equivalent.
Now you might be wondering when you should use a SELECT group, and when you should use IF-THEN-ELSE statements. When you have a long series of mutually exclusive conditions and the comparison is numeric, using a SELECT group is a bit more efficient than using a series of IF-THEN-ELSE statements. When you need SAS to evaluate just a few conditions (like one or two), you might find it more straightforward to just use an IF-THEN-ELSE statement. You wouldn't lose any efficiency in doing so. That said, if efficiency seems like a minor issue to you, you might just find it easier to read and debug programs containing SELECT groups.
Example 13.9. The following program uses a SELECT group to create a new variable called Course whose value depends on the value of Yards:
In this example, we did not include a variable (or expression) in parentheses in the SELECT statement. As SAS processes an observation, it starts then with the first WHEN statement. If the when-condition is true, SAS executes the statement that follows the when-condition. If it is false, SAS proceeds to the next WHEN statement. If its when-condition is true SAS executes the statement that follows the when-condition. And so on ... SAS proceeds in this way until it it reaches the OTHERWISE statement.
Let's have you launch and run the program, and review the output to convince yourself that SAS executes the SELECT group just as it would the comparable IF-THEN-ELSE statement.
Oh, you might also want to remind yourself why it is necessary to use the LENGTH statement to tell SAS explicity that the length of the Course variable is 7 characters. Recall that, by default, SAS defines the length of a new character variable to be the same as the length of the first value assigned to the variable in the DATA step. Here, then, in the absence of the LENGTH statement, SAS would define the length of the Course variable to be 4, because Long is the first value to be associated with Course. That then means that the values assigned to the Course variable would be Long, Medi, Shor, and Unkn because of the truncation that would ensue. To see this first hand, comment out (or delete) the LENGTH statement, re-run the program, and review the resulting output. Point made?! Let's take a look at one more example!
Example 13.10. The following SAS program is identical to the previous program, except the order of the conditions in the SELECT group has been reversed and the LENGTH statement has been removed:
Launch and run the program, and review the output to convince yourself that this program produces output identical to the output produced by the previous program. Then, there are two points that I'd like to make concerning this program.
First, the LENGTH statement is no longer necessary. Here, SAS would define the length of the Course variable to be 7, because Unknown is the first value to be associated with Course. Since Unknown is the longest value that we assign to Course, the default length of 7 works just fine for us. No need to override the default!
Second, consider the yardage of 6392 for the Centre Hills golf course. If you look at the SELECT group, you can see that a yardage of 6392 meets the condition of both the second and third WHEN statement. That is, 6392 is less than 6400, and it is less than 6700. How does SAS behave in this case? The answer lies in knowing that if more than one WHEN statement is true, SAS uses only the first WHEN statement. Once SAS finds a when-condition that is true, no other when-conditions are evaluated for a particular observation. So, here SAS deems that the second WHEN statement is true when Yards is 6392, and therefore assigns Short to the Course variable. SAS then steps out of the SELECT group, and being at the end of the DATA step, moves onto process the next observation in the data set.
Incidentally, the first WHEN statement in our SELECT group:when (Yards = .) Course = 'Unknown';
should remind us of another important point. Do you remember that good old programming habit about always programming for missing values? If we removed the above WHEN statement from our SELECT group, what value would SAS assign to Course if Yards were missing for a particular golf course? Remembering that a missing value (.) is considered smaller than any other numerical value, SAS would consider the missing value to be less than 6400, and Course would be errantly assigned to the value of Short. So, the good old programming habit holds for SELECT groups as well: always program for missing values.
Lesson 5: If-Then-Else Statements
Again, once you've read your data into a SAS data set, you probably want to do something with it. A common thing to do is to change the original data in some way in an attempt to answer a research question of interest to you. In the last lesson, we learned how to use assignment statements (and functions) to add some information to all of the observations in the data set. In this lesson, we will learn how to use if-then-else statements to add some information to some but not all of the observations in your data set.
Learning objectives & outcomes
Upon completing this lesson, you should be able to do the following:
- follow the good programming practice of programming for missing values
- write an if-then-else statement that involves any of the comparison operators
- write a series of mutually exclusive conditions for use in an (efficient) if-then-else statement
- use the AND and OR operators to combine conditions for use in an if-then-else statement
- write an if-then-else statement that compares character values efficiently and accurately
Our "to do" list for this lesson
In order to complete the lesson you should:
- Read the lesson pages that follows.
- Type up your answers to the homework problems in a Word file named homework05_yourPSUloginid. By now you should be used to the format. If your PSU login id is xyz123, then name your file homework05_xyz123. Upload the file to the Lesson #5 Homework Dropbox.
- Post any questions or comments you have concerning the lesson's material to the Lesson #5 General Discussion Board.
- Take the Lesson #5 Mastery Quiz.
5.1 - If-Then Statements
In this lesson, we investigate a number of examples that illustrate how to change a subset of the observations in our data set. In SAS, the most common way to select observations that meet a certain condition is to utilize an if-then statement. The basic form of the statement is:
IF (condition is true) THEN (take this action);
In the previous lesson, we looked at an example in which the condition was:
avg < 65
and the action was:
status = 'Failed'
For each observation, SAS evaluates the condition that follows the keyword IF — in this case, is the student's average less than 65? — to determine if it is true or false. If the condition is true, SAS takes the action that follows the keyword THEN — in this case, change the student's status to 'Failed.' If the conditon is false, SAS ignores the THEN clause and proceeds to the next statement in the DATA step. The condition always involves a comparison of some sort, and the action taken is typically some sort of assignment statement.
Example 5.1. There is nothing really new here. You've already seen an if-then(-else) statement in the previous lesson. Our focus there was primarily on the assignment statement. Here, we'll focus on the entire if-then statement, including the condition. The following SAS program creates a character variable status, whose value depends on whether or not the student's first exam grade is less than 65:
First, note that we continue to work with the grades data set from the last lesson. Again, the data set contains student names (name), each of their four exam grades (e1, e2, e3, e4), their project grade (p1), and their final exam grade (f1). Then, launch and run the SAS program. Review the output from the print procedure to convince yourself that the values of the character variable status have been assigned correctly.
5.2 - Comparison Operators
In the previous example, we used the less-than sign to make the comparison. We can use any of the standard comparison operators to make our comparisons as long as we follow the syntax that SAS expects, which is:
|Comparison||SAS syntax||Alternative SAS syntax|
|less than or equal to||<=||LE|
|greater than or equal to||>=||GE|
|not equal to||^=||NE|
|equal to one of a list||in||IN|
It doesn't really matter which of the two syntax choices you use. It's just a matter of preference. To convince yourself that you understand how to use the alternative SAS syntax though, replace the less-than sign (<) in the Example 5.1 program with the letters "LT" (or "lt"). Then, re-run the SAS program and review the output from the PRINT procedure to see that the program indeed performs as expected.
Example 5.2. The following SAS program uses the IN operator to identify those students who scored a 98, 99, or 100 on their project score. That is, students whose p1 value equals either 98, 99, or 100 are assigned the value 'Excellent' for the project variable:
Launch and run the SAS program and review the output from the PRINT procedure to convince yourself that the program performs as described.
A side comment. After being introduced to the comparison operators, students are often tempted to use the syntax EQ in an assignment statement. If you try it, you'll soon learn that SAS will hiccup at you. Assignment statements must always use the equal sign (=).
5.3 - Alternative Actions
As the output from Example 5.1 illustrates, there may be occasions when you want to use an if-then-else statement instead of just an if-then statement. In that example, we told SAS only what to do if the condition following the IF keyword was true. By including an else statement, we can tell SAS what to do if the condition following the IF keyword is false.
Example 5.3. The following SAS program creates a character variable status, whose value is "Failed" IF the student's first exam grade is less than 65, otherwise (i.e., ELSE) the value is "Passed":
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the values of the character variable status have been assigned correctly.
Note that, in general, using ELSE statements with IF-THEN statements can save resources:
- Using IF-THEN statements without the ELSE statement causes SAS to evaluate all IF-THEN statements.
- Using IF-THEN statements with the ELSE statement causes SAS to execute IF-THEN statements until it encounters the first true statement. Subsequent IF-THEN statements are not evaluated.
For greater efficiency, you should construct your IF-THEN-ELSE statements with conditions of decreasing probabilities.
5.4 - Programming For Missing Values
Example 5.4. This if-then-else stuff seems easy enough! Let's try creating another status variable for our grades data set, but this time let's allow its value to depend on the value of the student's fourth exam (e4) rather than the value of the student's first exam (e1):
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the values of the character variable status have been assigned correctly. Oooops! What happened?! SAS assigned a "Failed" status to John Simon, seemingly because his exam score is missing. That's certainly one way of assigning grades, but it's probably not going to make John very happy.
The important point to remember is that SAS considers a missing value to be smaller than any other numerical value. That is, a missing value (.) is considered smaller than 12, smaller than 183, and even smaller than 0. Thus, we should stick to another good programming habit: always program for missing values. Say it to yourself over and over and over again ... always program for missing values ... until you remember it. It may save you alot of trouble down the road.
Example 5.5. Now, let's look at our SAS program again, but this time having written the program so that SAS is told to assign status a missing value (a blank space ' ' since it is a character variable) if e4 is missing (a period . since it is a numeric variable):
Launch and run the SAS program. Review the output from the PRINTprocedure to convince yourself that this time the values of the character variable status really have been assigned correctly. Note that this program also illustrates the use of more than one ELSE statement. You can use as many ELSE statements as necessary, as long as they are attached to a preceding IF-THEN statement.
5.5 - Logical Operators
In addition to the comparison operators that we learned previously, we can also use the following logical operators:
|Operation||SAS syntax||Alternative SAS syntax|
|are both conditions true?||&||AND|
|is either condition true?|||||OR|
|reverse the logic of a comparison||^ or ~||NOT|
You will want to use the AND operator to execute the THEN statement if both expressions that are linked by AND are true, such as here:IF (p1 GT 90) AND (f1 GT 90) THEN performance = 'excellent';
You will want to use the OR operator to execute the THEN statement if either expression that is linked by the OR is true, such as here:IF (p1 GT 90) OR (f1 GT 90) THEN performance = 'very good';
And, you will want to use the NOT operator in conjunction with other operators to reverse the logic of a comparison:IF p1 NOT IN (98, 99, 100) THEN performance = 'not excellent';
Now when we look at examples using these logical operators, why stop at just two ELSE statements? Let's go crazy and program a bunch of them! One thing though — when we do, we have to be extra careful to make sure that our conditions are mutually exclusive. That is, we have to make sure that, for each observation in the data set, one and only one of the conditions holds. This most often means that we have to make sure that the endpoints of our intervals don't overlap in some way.
Example 5.6. The following SAS program illustrates the use of several mutually exclusive conditions within an if-then-else statement. The program uses the AND operator to define the conditions. Again, when comparisons are connected by AND, all of the comparisons must be true in order for the condition to be true.
First, inspect the program to make sure you understand the code. Then, launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the letter grades have been assigned correctly. Also note how the program in general, and the if-then-else statement in particular, is formatted in order to make the program easy to read. The conditions and assignment statements are aligned nicely in columns and parentheses are used to help offset the conditions. Whenever possible ... okay, make that always ... format (and comment) your programs. After all, you may actually need to use them again in a few years. Trust me ... you'll appreciate it then!
Oh, one more point. You may have noticed, after the condition that takes care of missing values, that the conditions appear in order from A, B, ... down to F. Is the instructor treating the glass as being half-full as opposed to half-empty? Hmmm ... actually, the order has to do with the efficiency of the statements. When SAS encounters the condition that is true for a particular observation, it jumps out of the if-then-else statement to the next statement in the DATA step. SAS thereby avoids having to needlessly evaluate all of the remaining conditions. Hence, we have ourselves another good programming habit ... arrange the order of your conditions (roughly speaking, of course!) in an if-then-else statement so that the most common one appears first, the next most common one appears second, and so on. You'll also need to make sure that your condition concerning missing values appears first in the IF statement, otherwise SAS may bypass it.
Example 5.7. In the previous program, the conditions were written using the AND operator. Alternatively, we could have just used straightforward numerical intervals. The following SAS program illustrates the use of alternative intervals as well as the alternative syntax for the comparison operators:
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the letter grades have again been assigned correctly.
Example 5.8. Now, suppose an instructor wants to give bonus points to students who show some sign of improvement from the beginning of the course to the end of the course. Suppose she wants to add two points to a student's overall average if either her first exam grade is less than her third and fourth exam grade or her second exam grade is less than her third and fourth exam grade. (Don't ask why! I'm just trying to motivate something here.) The operative words here are "either" and "or". In order to accommodate the instructor's wishes, we need to take advantage of the OR comparison operator. When comparisons are connected by OR, only one of the comparisons needs to be true in order for the condition to be true. The following SAS program illustrates the use of the OR operator, the AND operator, and the use of the OR and AND operators together:
First, inspect the program to make sure you understand the code. In particular, note that logical comparisons that are enclosed in parentheses are evaluated as true or false before they are compared to other expressions. In this example:
- SAS first determines if e1 is less than e3 AND if e1 is less than e4
- SAS then determines if e2 is less than e3 AND if e2 is less than e4
- SAS then determines if the first bullet is true OR if the second bullet is true
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that, where appropriate, two points were added to the student's average (avg) to get an adjusted average (adjavg). Also, note that we didn't have to worry about programming for missing values here, because the student's adjusted average (adjavg) would automatically be assigned missing if his or her average (avg) was missing. SAS calls this "propagation of missing values."
5.6 - Comparing Character Values
All of the if-then-else statement examples we've encountered so far involved only numeric variables. Our comparisons could just as easily involve character variables. The key point to remember when comparing character values is that SAS distinguishes between uppercase and lowercase letters. That is, character values must be specified in the same case in which they appear in the data set. We say that SAS is "case-sensitive." Character values must also be enclosed in quotation marks.
Example 5.9. Suppose our now infamous instructor wants to identify those students who either did not complete the course or failed. Because SAS is case-sensitive, any if-then-else statements written to identify the students have to check for those students whose status is 'failed' or 'Failed' or 'FAILED' or ... you get the idea. One rather tedious solution would be to check for all possible "typings" of the word "failed" and "incomp" (for incomplete). Alternatively, we could use the UPCASE function to first produce an uppercase value, and then make our comparisons only between uppercase values. The following SAS program takes such an approach:
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the if-then-else statement that involves the creation of the variable action is inadequate while the one that uses the UPCASE function to create the variable action2 works like a charm.
By the way, when making comparisons that involve character values, you should know that SAS considers a missing character value (a blank space ' ') to be smaller than any letter, and so the good habit of programming for missing values holds when dealing with character variables as well.
5.7 - Performing Multiple Actions
All of the examples we've looked at so far have involved performing only one action for a given condition. There may be situations in which you want to perform more than one action.
Example 5.10. Suppose our instructor wants to assign a grade of zero to any student who missed the fourth exam, as well as notify the student that she has done so. The following SAS program illustrates the use of the DO-END clause to accommodate the instructors wishes:
The DO statement tells SAS to treat all of the statements it encounters as one all-inclusive action until a matching END appears. If no matching END appears, SAS will hiccup. Launch and run the SAS program, and review the output of the PRINT procedure to convince yourself that the program accomplishes what we claim.
5.8 - Summary
In this lesson, we learned how to write if-then-else statements in order to change the contents of our SAS data set. The homework for this lesson will give you more practice with this technique so that you become even more familiar with how it works and can use them in your own SAS programming.