Thursday, June 19, 2014

Base64 Encodings

I was thinking more about what to post on this blog, and decided it couldn't hurt to post something on different encoding or encryption schemes that I find.  Given that, I was going back through and analyzing the different malware samples that I had referenced in the last post, and found the "NEWSREELS" sample.  This sample was very similar to the sample from the previous post, so I figured it would be a good place to continue.

Looking at this sample, I immediately go to strings to see what low-hanging indicators I may be able to find, when I come across these strings.

Our Friend From Last Post
So, we have an alphabet (albeit a different one from last post), and the same key.  Given this, I am betting that there is going to be a random character string in there that begs decoding.  A few lines down and we this sample doesn't disappoint.

Encoded or Encrypted String
So, random character string for ciphertext, a key, and an alphabet.  Since we have been here before, we can look for the key and the alphabet in the sample the same way that we did last time.  One function call off the main function we find it.

Alphabet and Key in Code
However, the logic looks slightly different than the last sample.  It would be too easy if both samples used the same logic, I suppose.  So stepping through the logic we see that this sample uses a circular alphabet and key (i.e., for array indices larger than length(alphabet) or length(key), it will wrap to the beginning of the array and keep counting) rather than static array locations.  So we make a few little changes to our python script (not trying to win any optimization awards here), and come up with this to decode the string.

Hastily Written Decoder

Now to run it and find our indicator like we did last time...

Not Exactly What We Were Hoping For
So, you may think that something has gone wrong at this point.  I assure you it hasn't.  All that has happened here is that the string decrypts into a base64 encoded string.  How do we know this?  After a while you get to be able to recognize base64 pretty easily (it is used lots).  Or you could just plug it into a base64 decoder site, but that isn't practical for every string like this we come across until we are able to readily recognize them.  Instead, let's do something even less practical and see where in the code this string is translated into something that the program can use.  Also, what if we didn't have this knowledge going in?  What if, during program debugging, we see this string, but aren't sure where to put breakpoints to determine where it will be decoded?  That is kinda the point of this post. 

So, lets think about what base64 does.  All it does is take 3 bytes of data (24 bits) and segments them into four 6-bit blocks.  It then translates the values into characters based on a 64-character alphabet (largest alphabet that can be made from 6 bits), and outputs those characters.  Since base64 works in 24-bit blocks, then there is a chance that the last block will translate to either one, two, or three 8-bit values.  Therefore, base64 uses padding characters (the "=" character) to designate how many 8 bit values are present in the last 24 bits.
  • Two padding characters ("==") at the end designates that only one 8-bit value is contained in the last 24 bits
  • One padding character at the end designates that two 8-bit values are contained in the last 24 bits
  • No padding characters at the end designates that the block is full, and three 8-bit values are contained in the last 24 bits
Given that, if the binary is planning on decoding this base64 string, it would need to check for those characters to determine how many values are encoded in the last 24-bit block.  So, we can look for that easily enough.  There are a number of different ways this can be expressed in assembly, but since the input is in ASCII characters, we can look for the ASCII or hex equivalent values being used in program logic.  Following another function call or two, and we find what we are looking for.

Padding Character Check
And here we are.  IDA was nice enough to point out that the SubStr offset returned "==" in memory.  A quick lookup of the strstr() function shows that its purpose is to find the first occurrence of a given substring within a given string, with both strings given as arguments.  With the "==" string being pushed to the stack, as well as our input string, just before the strstr() call, we can determine that these are being passed as parameters to this function.  In short, it is trying to find if and where the string "==" is located in the input string to determine the end of the function, and if it contains two padding characters.  This in and of itself is not purely indicative of base64 decoding, but what follows drives it home.

Looking at the "jz" (jump if zero) instruction, we see that it jumps if the "test eax, eax" instruction sets the zero flag.  This is equivalent to "if eax = 0, jmp loc_401916".  EAX will return zero if the "==" string is not found in the input string.  So, if the "==" string is not found, and this is base64 decoding logic, we would expect the next thing for the binary to do is determine if a single padding character terminates the input string.  Going down to loc_401916, we are not disappointed, as we see a "push 3Dh" instruction.  Since 0x3D is the hex value for the ASCII character "=", we can determine that this is the single padding character check that we are looking for.  We decide this due to the fact that this character is being pushed as an argument, along with the input string, to the strchr() function.

Now, we can determine that this possibly the start of the base64 decoding logic, but we would also like to find the end.  This way we can break on the end of the logic, and check memory for the decoded value.  Yes, at this point we could just put the string in a base64 decoder, but that is just too easy.  So, lets go down through the code some more and see if we can determine where the base64 decoding logic is.

base64 Decoding Logic
Remembering from earlier, base64 takes three 8-bit values, and stores them in four 6-bit values.  Knowing that, it would stand to reason that the decoding of the input string would require four operations on the string input string itself to determine the output.  Also, given that this input string is larger than four characters, it stands to reason that the decoding logic will contain some form of loop logic to iterate through the entire string.  At loc_40197D, we find four strchr() function calls, with translation logic for the 6-to-8 bit translation, contained in a loop.  By setting a breakpoint at the end of this loop (loc_401A20) we can see the decoded string.

Decoded String
So, the base64 encoded string translates to "http://61.219.67.1/Rossini.jpg".  Detonating the malware and listening with wireshark confirms this callout.

Callout to Address in Decoded String
It is a fair bet that this JPEG file is not a JPEG file, but either a configuration file or second-stage executable being downloaded.  Knowing this, we could get on a non-attributable network and download it to find out.  Even taking all necessary precautions, obtaining this file from the same network space that the malware was found in could tip your hand, letting the adversaries know that you are onto them.  At which point they will drop this malware and user another, making the indicators you just found less effective.

However, we did just find a bit of malware that uses the same key as the first piece of malware we looked at.  Therefore, we could adjust the yara signature that we created in the last post to include both the indicators from this sample, as well as the relationship between the two binaries.  Yara lets us do this by allowing us to reference other rules in our "condition" statement.

Our Related Yara Rules
Since the key is the common indicator between the two binaries, we put that string in its own yara rule with the name of the malware family or adversary attribution (in this case we named our malware family "Malware_Family").  Then we but the remaining indicators in their own separate rules for each malware sample, and state in the "condition" statement that we require the strings in the "Malware_Family" sample to be found in order to trigger the separate malware sample rules.  

You will also see the "ascii" and "wide" modifiers at the tail end of our strings.  These designate it the characters are represented by either one-byte ("ascii") or two-byte ("wide") values.  The two-byte representation is commonly used by Unicode string representations.  However, it is important to note that according to the Yara User's Manual, yara does not fully support true UTF-16 encoding for non-English characters.  I also apologize as I noticed that I did not include the yara rule with these modifiers on the previous post, an issue I will be remedying shortly.

So, we have our rules, now to test them out.
Our Rules Hit on Both
We can see in the red boxes that yara returns hits for both the malware sample rules and the "Malware_Family" rule.  We can use this method to create relationships as we see more samples, add some intelligence to our yara rules, and make them more effective in recognizing newer related samples.



No comments:

Post a Comment

Threat Hunting: Experts Not Required

Apologies to anyone that may have periodically checked this blog in the past few years only to find the same two posts (I'm betting th...