Mining Data Streams

When working with data we often do not know the entire data set in advance. For example, think of the dataset of all Google queries, this dataset is constantly changing and becoming larger. This is what leads us to the idea of data streams. These data streams are potentially infinite and non-stationary i.e their distributions are constantly changing.

Data Stream Model Let us take a closer look at what data streams look like. We have a system that is receiving input at a rapid rate from one or more input ports (each port being a separate stream). We also think of this input as elements or tuples of data.

Now we want to make decisions or calculations using these data streams. But we run into a few problems, the data is constantly changing and we can not fit all the data in memory, be it secondary or primary. So we need to develop some algorithms to either sample the data or make approximations.

Example

Think of Google as the system. They have multiple data streams, one could search queries another maybe from Gmail or Google Drive etc. If we take a look at the search query stream maybe google wants to find out which search queries are popular at the minute to create a trending page.

Sampling a Data Stream

Since we can't store and use the entire data stream we want to take samples of a data stream. However, we want these samples to be fair and represent the entire data stream which can make things very complicated, for example, if we want a sample size of 100 elements we can not just keep track of the last 100 elements as this is not a fair representation of the entire data stream.

When working with samples we are interested in two types of samples either a fixed proportion of the data stream for example 10% of all elements. Or we can get a random sample of fixed size for example 100 elements.

Fixed Proportion Sample

When working with a fixed proportion sample we, for example, want 1 in 10 elements, i.e 10% of the data. However, this does have some issues, if the data stream is not very large then 10% might not be enough data to do our task, be it a calculation, decision or prediction etc.

You can also imagine the opposite, since the data stream is infinitely large this fixed proportion will grow infinitely which could lead to use not being able to store it in memory.

Naive Approach

Let us carry on with the example of Google search queries, we might imagine that Google receives the following tuples in a stream (user, query, time). If we could get a 10% sample of the data stream we can estimate the answer to the question "How often did a user query the same at least twice?".

Our first naive approach might be that we have a 10-sided dice and every time we receive a new tuple we roll the die and if it hits 1 we add the tuple to the sample. If we would be programming this instead of throwing a dice we could generate a random number between 1 and 10, $r \in [1,10]$ and if $r=1$ we store the sample.

However, this approach has a flaw. Imagine each user issues $s$ queries once and $d$ queries twice, the total number of queries executed is $s+2d$ . So if we wanted to know the fraction of double queries made it would be $\frac{d}{s+2d}$ . However, with this uniform approach our sample will contain $s/10$ of the single queries, $2d/10$ of the duplicate queries at least once but only $d/100$ pairs of duplicates. So our naive approach comes to the solution of $d/(10s+20d)$ which is not correct.

Bucket Approach

A different approach would be to take all the queries of 10% of the users. This could be done by defining the user as the key of the tuple and then uniformly hashing the key into 10 buckets. If we then wanted a 20% sample we could just take all the users that were hashed to the buckets with value $h \leq 2$ .

Todo

Some probabilities would be interesting to see just as in the exercises.

Fixed Size Sample - Reservoir Sampling

Sometimes we only want to have a fixed sample $S$ of $s$ tuples because maybe we are only able to fit that amount into memory. For this to be a fair and representative sample we want each element to have an equal probability of being in the sample. So if we want a sample size of $s=2$ and at time $t=5$ we have seen 5 elements then we want each element to be in the sample with a probability of $s/t=2/5$ .

We can do this using the so-called Reservoir Sampling Algorithm:

We store the first $s$ elements of the stream $S$ .
Then suppose we have seen $t-1$ $t - 1$ elements and the $t$ $t$ -th element arrives.
1. With a proability of $s/t$ we keep the $t$ -th element. If we keep it we need to make space for it in our sample we do this by kicking out an element picked uniformly at random.
2. With a probability of $1-s/t$ we discard it

You can simply think of a water reservoir where if fresh water comes in some old water spills/overflows to make space for it.

We can prove that this algorithm fits our requirements by using proof by induction:

We assume that after seeing $t$ elements, the sample contains each element seen so far with the probability of $s/t$ . For our induction step we need to show that after $t+1$ elements the sample still fulfills the same requirements, i.e the sample contains each element seen so far with the probability of $s/(t+1)$ .

Inductive step: For an element already in the sample the probability that it stays in the sample is:

(t+1 \text{ discarded}) + \Big((t+1 \text{ not discarded}) (\text{element not picked})\Big) \newline = (1- \frac{s}{t+1})+\Big((\frac{s}{t+1})(\frac{s-1}{s})\Big) = \frac{t}{t+1}

So if time is now $t \to t+1$ then the probability that the tuple is still in the sample is

\frac{s}{t}\cdot \frac{t}{t+1} = \frac{s}{t+1}

Queries Over a Sliding Window

We very often find ourselves querying a data stream about the most recent input elements. This can be further generalized to processing queries using a sliding window. This sliding window holds $N$ elements.

We could solve this problem quite easily by just keeping track of the $N$ most recent elements. But what if the $N$ elements take up to much storage and therefore can not be stored in memory? Or we have multiple streams and therefore want to minimize the memory usage as much as possible.

For this let us move on to our next example of Amazon being our system and that we have for every product $X$ a data stream that consists of 1 or 0 for if that specific item was sold at the time. We could easily come up with the query of how many times has the product $X$ been bought in the last $k$ sales.

Thanks to our stream structure we can generalize this query to how many 1s are in the last $k$ elements where $k \leq N$ .

As mentioned before we don't just want to keep track of the last $N$ elements. Instead, we might think of maintaining 2 counters, $A$ for the number of 1s from the beginning of the stream and $B$ for the number of 0s from the beginning of the stream. If we then assume that the distribution is uniform we would get the following formula:

N \cdot \frac{A}{A+B}

This solution is very simple but not very accurate as the distribution is not necessarily as uniform as we assumed. To prove this we can just think of the sales of Ice cream, it might be sold a lot in summer but much less in winter.

DGIM Method

The DGIM method is another way to approximate answer the above question of how many 1s are in the last $k$ elements of a data stream. However, it does not assume uniformity and only uses $O(\log^2 N)$ storage which is more than just 2 counters but it makes use of the sliding window instead of looking at the entire data stream. There is however a drawback to this approach, since it is an approximation there will be an error, and in the worst case this method will produce an error of 50%! You might think this is a lot which it is, however, we can reduce this error to a small fraction $> 0$ in exchange for using more memory.

The main idea of the DGIM method is that we split the data stream into blocks with exponentially increasing sizes. Firstly we define that each element has a corresponding timestamp of when it was input into the stream, for example, the first element has the timestamp 1, the next 2 etc.

A block in the DGIM method consists of the following parts:

The timestamp $t$ of its end. But the be able to represent relevant timestamps to the window and keep storage small we do $t \mod N$ . This uses only $\log N$ bits.
The number of 1s between its beginning and end. This uses $\log(\log N)$ bits because we add the constraint that the number of 1s in a block must be of a power of 2.

We also add a few more constraints to the algorithm apart from the amount of 1s having to be a power of 2:

There can only be one or two Blocks with the same amount of 1s.
Blocks can not overlap.
Blocks are sorted by size, meaning that "older" blocks can not be smaller than "newer" blocks.
Blocks disappear when they are out of the window, i.e their end time is larger than $N$ .

This leads us to have at max $\log N$ blocks, and each block needs $O(\log N)$ bits so we have a total memory usage of $O(log^2 N)$ .

Updating Blocks

When new values come into the system we need to maintain our data structure. If the new element is 0 we can ignore it but if it is a 1 we need to do the following:

Create a new Block of size 1 containing just the new element with the end timestamp being the current time.
If there are three blocks of size 1, we need to combine the two oldest blocks into a block of size 2.
Recursively check if there are three blocks of the same size and combine the oldest blocks.

Querying

Now we just need to know how to query the DGIM data structure we have created to estimate the number of 1s in the sliding window. For this, we sum the sizes, i.e number of 1s of all the blocks apart from the last block. For the last block, we only add half the size. We only take half of the last block because we do not know how many 1s of the last black are actually still within the sliding window. This is also what leads us to our error of 50%. We can prove this by assuming the last block has the size $2^r$ , then we assume that half i.e $2^{r-1}$ of its 1s are still in the window. We can also assume that there is at least one block of each size smaller than $2^r$ which sums up to $1+2+4+...+2^{r-1}= 2^r -1$ . So our max error can be defined as the following:

\frac{2^{r-1}}{2^r -1}

Reducing Error

We can further reduce the error by changing the constraint of having either 1 or 2 blocks of the same size. Instead, we allow there to be either $x-1$ or $x$ blocks of the same size (excluding the largest size which can have between 1 and $x$ blocks). The error is then at most $1/x$ . We can tradeoff between the maximum error and the amount of storage used.

Todo

Example of worst case with error 50%.

This does not make a lot of sense with the end timestamp as it would already be gone as it is larger then $N$

Filtering a Data Stream

We often find ourselves wanting to filter data, for example, if we receive a search query we want to filter the results to only contain the search keyword. However the most common example to explain these algorithms is an email spam filter, when we receive a new email we want to filter it into the good or spam pile. In a more general form if we have a list of elements $S$ we want to determine which elements of a stream are in $S$ . The most obvious solution would be a hash table, however as always when working with streams we most likely will not have enough memory to store all elements of $S$ in a hash table.

First Approaches

Our first approach could be a more streamlined hash table. We create a bit array $B$ with $n$ bits that are all initially set to 0. We then use the hash function $h$ to hash each element $s \in S$ to one of the $n$ buckets and set that bit to 1 i.e $B[h(s)]=1$ .

To then find out if an element $e$ is in $S$ we just hash the element $e$ using the same hash function $h$ and check if the value is 1. The solution creates false positives due to collisions but no false negatives.

Todo

probability of false negative

Bloom Filter

The bloom filter is very commonly used and is heavily inspired by our first approach however it aims to minimize the false negative rate by using multiple hash functions.

We have the same setup as in our first approach where we have "training data" $S$ with $m$ elements and a bit array $B$ with $n$ elements initialized at 1. However, now we have $k$ independent hash functions $h_1, h_2, ... h_k$ . Now in our "training" phase we hash each element $s \in S$ using each hash function and set the value of the bit array to 1. So $B[h_i(s)]= 1$ for $i \in [1,k]$ . Important to not get confused here is that we only have one bit array $B$ not multiple.

If we now want to check to see if an element $e$ is in $S$ we again use all the hash functions but now check if all the values are 1. If even only one of the values isn't 1 then we can with 100% certainty say $e \notin S$ .

Todo

probability of false negative and how to pick k.

Counting Distinct Elements

We have lots of use cases for wanting distinct elements for example Amazon might be interested to know how many distinct products they have sold in the last week. The most obvious approach would be to maintain a set of elements seen so far i.e use a hash table to keep track of all the distinct elements seen so far and their count. But as always we assume we do not have enough memory to store a hash table big enough so instead we want to be able to estimate the count fairly.

Flajolet-Martin algorithm

The Flajolet-Martin algorithm does exactly that. It is quite a simple algorithm we hash each element $e$ that comes into the system and use the output as in index of a bitstring and set it to 1, important to note is that the hash function is no longer uniform but exponential. In other words, $1/2$ of the elements map to the first bit, $1/4$ to the second bit etc.

Our intuition of this is then that the probability of the first bucket being 1 is $1/2$ the second $1/4$ the $k$ -th bit then has a probability of $1/2^k$ of being 1. This means that if the $k$ -th bit is set to 1 then an event with a probability of $1/2^k$ has happened which we would expect if we inserted $s^k$ distinct elements.

The algorithm then says that if $R$ is the position of the least 0 bit then the number of distinct elements is $2^R/C$ where $C$ is some constant.

We can further improve the accuracy of this algorithm by using $k$ different hash functions und bitstrings. We then compute the average position of the least 0 bit $b$ .

The number of distinct elements is then $2^b/C$ where $C$ is still some constant.

The storage required for this algorithm is also very small. If we use $k$ hash functions and $k$ bitstrings and given a N distinct elements we only need $O(k log N)$ bits.

Moments

First, let us define what a moment is and then see what they can be used for. Suppose we have a data stream $S$ and the set $A$ contains all the distinct elements in the stream. We then let $m_i$ be the number of times the value $i$ occurs in the stream $S$ . We can then define the $k$ -th moment as:

\sum_{i \in A}{(m_i)^k}

Example

If we have the stream $S=[x,y,x,y,z,z,z,x,z]$ then the 2nd moment is $3^2+2^2+4^2=29$

If we take a closer look at different moments we will notice the following:

The 0th moment corresponds to the number of distinct elements, just as we saw below
The 1st moment corresponds to the number of elements in the stream $S$ i.e $|S|$ .
The 2nd moment we can call the surprise number of $S$ which is a measure of how uneven the distribution of $S$ is (the lower the value the more even the distribution).

Example

If we have the following values for $m_i: [10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]$ then the surpise value is $910$ because the distribution is pretty even.

If $m_i: [90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]$ then the surprise value is $8110$ .

AMS Method

Now we want a method to be able to give an unbiased estimate of the $k$ -th moment. First, let us look at the 2nd moment

Todo

Confussed by the algorithm

Counting Frequent Itemsets

We can imagine that Amazon or Google often ask themselves the question given a data stream how they can find recent frequent items, frequent items being defined by a threshold and recent meaning being in a sliding window.

TF-IDF Finding Similar Items