Thursday, June 19, 2014

Base64 Encodings

I was thinking more about what to post on this blog, and decided it couldn't hurt to post something on different encoding or encryption schemes that I find.  Given that, I was going back through and analyzing the different malware samples that I had referenced in the last post, and found the "NEWSREELS" sample.  This sample was very similar to the sample from the previous post, so I figured it would be a good place to continue.

Looking at this sample, I immediately go to strings to see what low-hanging indicators I may be able to find, when I come across these strings.

Our Friend From Last Post
So, we have an alphabet (albeit a different one from last post), and the same key.  Given this, I am betting that there is going to be a random character string in there that begs decoding.  A few lines down and we this sample doesn't disappoint.

Encoded or Encrypted String
So, random character string for ciphertext, a key, and an alphabet.  Since we have been here before, we can look for the key and the alphabet in the sample the same way that we did last time.  One function call off the main function we find it.

Alphabet and Key in Code
However, the logic looks slightly different than the last sample.  It would be too easy if both samples used the same logic, I suppose.  So stepping through the logic we see that this sample uses a circular alphabet and key (i.e., for array indices larger than length(alphabet) or length(key), it will wrap to the beginning of the array and keep counting) rather than static array locations.  So we make a few little changes to our python script (not trying to win any optimization awards here), and come up with this to decode the string.

Hastily Written Decoder

Now to run it and find our indicator like we did last time...

Not Exactly What We Were Hoping For
So, you may think that something has gone wrong at this point.  I assure you it hasn't.  All that has happened here is that the string decrypts into a base64 encoded string.  How do we know this?  After a while you get to be able to recognize base64 pretty easily (it is used lots).  Or you could just plug it into a base64 decoder site, but that isn't practical for every string like this we come across until we are able to readily recognize them.  Instead, let's do something even less practical and see where in the code this string is translated into something that the program can use.  Also, what if we didn't have this knowledge going in?  What if, during program debugging, we see this string, but aren't sure where to put breakpoints to determine where it will be decoded?  That is kinda the point of this post. 

So, lets think about what base64 does.  All it does is take 3 bytes of data (24 bits) and segments them into four 6-bit blocks.  It then translates the values into characters based on a 64-character alphabet (largest alphabet that can be made from 6 bits), and outputs those characters.  Since base64 works in 24-bit blocks, then there is a chance that the last block will translate to either one, two, or three 8-bit values.  Therefore, base64 uses padding characters (the "=" character) to designate how many 8 bit values are present in the last 24 bits.
  • Two padding characters ("==") at the end designates that only one 8-bit value is contained in the last 24 bits
  • One padding character at the end designates that two 8-bit values are contained in the last 24 bits
  • No padding characters at the end designates that the block is full, and three 8-bit values are contained in the last 24 bits
Given that, if the binary is planning on decoding this base64 string, it would need to check for those characters to determine how many values are encoded in the last 24-bit block.  So, we can look for that easily enough.  There are a number of different ways this can be expressed in assembly, but since the input is in ASCII characters, we can look for the ASCII or hex equivalent values being used in program logic.  Following another function call or two, and we find what we are looking for.

Padding Character Check
And here we are.  IDA was nice enough to point out that the SubStr offset returned "==" in memory.  A quick lookup of the strstr() function shows that its purpose is to find the first occurrence of a given substring within a given string, with both strings given as arguments.  With the "==" string being pushed to the stack, as well as our input string, just before the strstr() call, we can determine that these are being passed as parameters to this function.  In short, it is trying to find if and where the string "==" is located in the input string to determine the end of the function, and if it contains two padding characters.  This in and of itself is not purely indicative of base64 decoding, but what follows drives it home.

Looking at the "jz" (jump if zero) instruction, we see that it jumps if the "test eax, eax" instruction sets the zero flag.  This is equivalent to "if eax = 0, jmp loc_401916".  EAX will return zero if the "==" string is not found in the input string.  So, if the "==" string is not found, and this is base64 decoding logic, we would expect the next thing for the binary to do is determine if a single padding character terminates the input string.  Going down to loc_401916, we are not disappointed, as we see a "push 3Dh" instruction.  Since 0x3D is the hex value for the ASCII character "=", we can determine that this is the single padding character check that we are looking for.  We decide this due to the fact that this character is being pushed as an argument, along with the input string, to the strchr() function.

Now, we can determine that this possibly the start of the base64 decoding logic, but we would also like to find the end.  This way we can break on the end of the logic, and check memory for the decoded value.  Yes, at this point we could just put the string in a base64 decoder, but that is just too easy.  So, lets go down through the code some more and see if we can determine where the base64 decoding logic is.

base64 Decoding Logic
Remembering from earlier, base64 takes three 8-bit values, and stores them in four 6-bit values.  Knowing that, it would stand to reason that the decoding of the input string would require four operations on the string input string itself to determine the output.  Also, given that this input string is larger than four characters, it stands to reason that the decoding logic will contain some form of loop logic to iterate through the entire string.  At loc_40197D, we find four strchr() function calls, with translation logic for the 6-to-8 bit translation, contained in a loop.  By setting a breakpoint at the end of this loop (loc_401A20) we can see the decoded string.

Decoded String
So, the base64 encoded string translates to "http://61.219.67.1/Rossini.jpg".  Detonating the malware and listening with wireshark confirms this callout.

Callout to Address in Decoded String
It is a fair bet that this JPEG file is not a JPEG file, but either a configuration file or second-stage executable being downloaded.  Knowing this, we could get on a non-attributable network and download it to find out.  Even taking all necessary precautions, obtaining this file from the same network space that the malware was found in could tip your hand, letting the adversaries know that you are onto them.  At which point they will drop this malware and user another, making the indicators you just found less effective.

However, we did just find a bit of malware that uses the same key as the first piece of malware we looked at.  Therefore, we could adjust the yara signature that we created in the last post to include both the indicators from this sample, as well as the relationship between the two binaries.  Yara lets us do this by allowing us to reference other rules in our "condition" statement.

Our Related Yara Rules
Since the key is the common indicator between the two binaries, we put that string in its own yara rule with the name of the malware family or adversary attribution (in this case we named our malware family "Malware_Family").  Then we but the remaining indicators in their own separate rules for each malware sample, and state in the "condition" statement that we require the strings in the "Malware_Family" sample to be found in order to trigger the separate malware sample rules.  

You will also see the "ascii" and "wide" modifiers at the tail end of our strings.  These designate it the characters are represented by either one-byte ("ascii") or two-byte ("wide") values.  The two-byte representation is commonly used by Unicode string representations.  However, it is important to note that according to the Yara User's Manual, yara does not fully support true UTF-16 encoding for non-English characters.  I also apologize as I noticed that I did not include the yara rule with these modifiers on the previous post, an issue I will be remedying shortly.

So, we have our rules, now to test them out.
Our Rules Hit on Both
We can see in the red boxes that yara returns hits for both the malware sample rules and the "Malware_Family" rule.  We can use this method to create relationships as we see more samples, add some intelligence to our yara rules, and make them more effective in recognizing newer related samples.



Thursday, June 12, 2014

Why Those Strings?

So I was looking back at some of the older malware on my backup drive and came across a bunch of malware I got from contagio a while back.  Like anything you find that you haven't looked at in a while, I stopped to see what I had, and noticed that I had started looking at some of them, but there weren't any notes.  So I decided to grab one of the smaller ones and just have a look.

I decided on the "SWORD" malware sample from the APT1 section.  This was described as a single-sourced remote command utility, so I figured it would be a good little one to start the blog off with.  Initially I just ran strings across it to see if there was anything there of interest.  As strings is basically the lowest hanging fruit in malware analysis, I didn't expect to find anything that I would have been able to take to responders and say "You'll need to look for this."  If only.  What I was really looking for were any strings that weren't explicitly helpful (i.e., appear random or gibbereish) but could potentially be used by the binary to do something interesting.  Here are the strings of interest that I found in the binary.

Random Strings and Misspellings
I always like to pick out the strings with misspellings, misused words, or 133t speak, they can come in handy.  The other strings could be useful in creating a yara signature for this, if we can determine what their use is in the binary.  So, I fire up IDA to see what is actually going on with these strings.  The first call in the program that isn't to a Windows API points us to the following section:

Random Strings Seen Used
Okay, so they at least two of them are being used in the code.  Looking at the pattern of the first string that we find in the code, we can see that it is the keys of the keyboard in the following order:

  1. Start at the numeric row, and press every key from right to left
  2. Retype the numeric row, this time holding the Shift key
  3. Repeat until the last row is typed, skipping the control keys
  4. Remove the &"\{}
We will call this the "keyboard" string.  The second is "the quick brown fox jumps over the lazy dog" with the vowels (except y) removed after the word 'brown'.  We will call this the "fox" string.  This is not necessarily important, but the eyes are prone to determining patterns.  So we know that these are being used in the program, and that there is some logic surrounding them.  Checking through the rest of the program doesn't show the last strings being used.  However, running it in a debugger shows that string being passed as to the same function as the other two.

There is Another String!
Right.  We'll call this one String 3.  So, we have three strings in this section, and some logic around them.  What is that logic doing?  Well, initially it is just reading the values into memory locations, but the interesting part of what these are doing comes around this code segment.

The Interesting Bit
We see three separate strchr calls, and they all sit within a loop.  This tells us two things.  One, the program was written in something that can use native C strings (such as C or C++), and that this is likely pulling something out of each string in a formulaic method to create the actual data that it needs.  In debugging, we see that ecx being pushed the first two times is 0x75, which is the 'u' character, which is the starting character of string 3.

So, first strchr calls determines where in the "keyboard" string is the first character of String 3 located.  Then, if the location is equal to zero, then the program throws an error.  Otherwise, it will check again (the second strchr call) and store in esi.  This segment then goes on to grab the first character in the "fox" string, and determine where it is located in the "keyboard" string (the third strchr call).  This leaves with the following values stored in memory:

  • Location of the first character of String 3 in the "keyboard" string
  • The first character of the "fox" string
  • Location of the first character of the "fox" string in the "keyboard" string
Just after the "jnz short loc_4010D4" call we see the logic that subtracts the strchr(keyboard, fox[0]) value from the strchr(keyboard, String3[0]) value.  Then the program grabs the value in the "keyboard" string at the position resultant from the previous subtraction.  So the ending location that the program is looking for can be written like this:

value = keyboard[strchr(keyboard, String3[0]) - strchr(keyboard, fox[0])]

Knowing that, I decided to knock up a quick version of the logic (without the error checking) just to see what the value would be before going through debugging it.  I decided to do it in python, because I hadn't used it in a while and why not?  This is what I came up with:


So, we run this real quick to see what we come up with and...


Yep.  There is the IP address and port number used by this binary (remember it only talks to one location).  Checking in debugging and by detonating this binary confirms what we are seeing from running our script.


There are the callouts to the same IP address and port that we determined from our script!  Now, we could have gotten this IP and port by dynamic analysis, and we could have determined that two of the strings were used in the binary.  Some analysts would continue on either due to time constraints, or satisfied that indicators were found.  However, by actually looking into why and how these strings are used, we can be confident in using those strings in yara rules that could help us distinguish the presence of this malware in the future.  Ideally, we would want to determine the commonality of these string between multiple samples, but I only had one sample of this malware.  However, we could create something like the following as a starting rule:

Yara Rule for Given Indicators

So, when you get asked by someone "Why did you use these strings for your signature?", you will have an answer. =)


[EDIT: 06.19.14 - Fixed yara rule to include the "wide" and "ascii" modifiers]

Threat Hunting: Experts Not Required

Apologies to anyone that may have periodically checked this blog in the past few years only to find the same two posts (I'm betting th...