AMD&BD: Assignment 1 part II
By Kasper Bouwens, Alexander van Someren, Casper Thuis and Wolf Vos
Wifi tracking at KPMG-restaurant
Introduction
The following part documents the analysis of the Wifi data at the KMPG restaurant. TODO: introduction
Setup
Reading in the data
In this part, we read the data into a pandas DataFrame
and define the router positions provided by KPMG. Furthermore, we add a DateTime
column, to make the measured times more readable. Finally we add a temp_id
field that is simply the concatination of typeNr
, subTypeNr
and seqNr
.
Identifying seperate packages
The above explained (and declared) temp_id
- column is not the final package-id because the seqNr
variable repeats itself. Therefore, the task of identifying the seperate packages is still at hand. We can use the measurementTimestamp
for this, assuming that the measurement times from a single package should not be too far apart.
The question now however, is what "too" far apart is. Therefore we plot the number of different packages against the maximum number of time difference below from ms to 5000 ms.
We clearly see that from zero to 250 ms the number of identified packages decreases drastically and stabalizes from 500 ms. From 800 ms or so, the number stabalizes untill 3.5 seconds. This could mean that this is the maximum of time difference caused by router hardware issues and is thus a good number to use for identifying packages. It is good to realize that every value from 500 ms and higher will only cause very few packages to merge, so most packages will ramain the same.
Estimation of device position and uncertainty
In this section we will use the above identified packages to estimate the device position for each package. To do this we will minimize
where is the measured signal strength we sum over the routers and
$$P_r^i(x,y) = Pt + 20 \log_{10}(\frac{c}{4\pi f}) - 10 \log_{10}((x - x_i)^2 + (y - y_i)^2 + Z^2 )\\$$is a modified version of the Friis space equation that gives the expected signal strength for a given and location. For this case, height Z=2 since all the routers are 3 m above the ground and all smartphones are 1 m above the ground.
The above translates in code as follows:
The next step is to minimize the function to obtain the optimal device positions. This is done in the calculate_positions
method below.
Parameters for the optimization procedure
To calculate the positions, we need provide two extra parameters. The expected sigma (or noise) in our system and an initial guess for the coordinates, which will help the minimization algorithm.
Expected noise
One way to assess the amount of noise in the system is by choosing in such a way that the value matches the degrees of freedom in the system.
In an experiment like this, we expect that avarage value of is equal to the degrees of freedom in the system. In the best cases, we have 11 routers (or datapoints) for a package and we have 3 parameters to estimate, that will result in degrees of freedom. However, as we have seen above, most packages are not measured by all routers. Therefore, one could argue that we should define the degrees of freedom seperately for each number of measurements per package (e.g. packages that were recieved by all routers get 8 DoF, packages that were recieved by 10 routers get 7 DoF). An advantage of this approach would be that we obtain a different expected noise for each number of measurements, which is in agreement with the idea that we expect less noise when the package is measured by more routers.
This approach sadly, also has a major disatvantage. It will work better with more data, and will probably work poorly for the (small) dataset at hand. Imagine that for example there are only two packages that were measured by 10 drones, of which one was measured with a great amount of noise, then the the matching of the value to the number of degrees of freedom will yield a that is too high.
Therefore we could try a different approach: to combine all the measurements and estimate one value and just set it to match the average degrees of freedom in the system. Let us first look at the distribution of the number of measurements per package.
As can be seen, the average number of measurements per package, 3.7, is quite low. If we look at the histogram, this is due to the big number of packages with only one or two measurements. Two measurements are obviously not enough to perform triangulation, so we remove them in our further analysis.
Furthermore we choose the largest bin (10 measurements) to estimate our sigma. With 10 measurements, we have DoF, so we will look for a value of 7.
We obtain a of 8.5.
Initial guess
The other parameter for the optimization algorithm that we need to choose is the initial guess for the optimiztion. In a real life situation, you could use some clever heuristic like the previous location of the phone, but a simpler, naive guess is the average of router positions or the center of routers.
Results
We are now ready to calculate the devices positions with an expected error of 8.5 and an initial guess that is the center of the routers:
First round
Now, we run the actual experiment.
We plot a histogram of the values and compare them with the distribution. Since we use the values of the entire dataset and, as we seen above, the 7 DoF does not hold for every datapoint, we should not expect a perfect fit.
As expected, the fit is not perfect. However, it does appear to follow some distributions, or maybe a combination of distributions.
To be better able to assess the quality of the model and more specifically to assess the uncertainty of our system, we make a historgram of the normalized residuals (also called a pull distribution). This distribution should be a normal distribution with a mean of zero (that is how we optimized) and a standard deviation of one (because we devide by the expected noise).
We see that the distribution is nicely centered around zero. However, the standard deviation is lower than one, so despite the earlier attempt to match the value to the degrees of freedom in the system, a of 8.5 seems to be a slight overestimation of the noise in the system.
Round 2
By experiments, we found that a of 6 provided a pull distributions with a standard deviation of ~1.
The choice of seems to reflect the amount of noise in the system well. Therefore we will use these results to plot the positions.
Estimating position uncertainty
In the previous section, we have tried to find the device positions that minimize the function given an expected signal noise. However, we are not only interested in the most likely positions, but also want to assess the spacial uncertainty of our estimation.
To do this we rederive the B matrix. The derivation are quite long and therefore are not included.
The results however are shown below.
ParseError: KaTeX parse error: {align*} can be used only in display mode.
Invering the B matrix results in calculating variance in the x and y. Which are in the frist and second diagonal elements respectively.
System resolution
Average transmission power
The estimated transmission powers look reasonable. They are never higher than one. Also, they are quite a bit higher than the measured signal strengths. This is obvious since the signal should weaken according to the Friis equation.
Combining nearby packages
As the assignment mentions:
Some wifi packets were sent so shortly after each other that the device could not have moved very far if at all.
The suggested approach is to "compose another chi-squared method to combine the locations of wifi packets that are closely spaced in time into one location". This time, the chi-squared minimization will not be trying to minimize the sum of squared distances in signal strengths, but the sum of squared euclidean distances between the optimal combined position and the earlier positions. This translates to minimizing:
Where we sum over the packages that are combined to the current packages and:
Is the vector of the x and y location of the combined measurement,
Is the vector of the previously estimated x and y location of the original package and:
Is the vector of previously estimation of the uncertainties of that package.
The first thing to do is identify packages that are close to each other in time. We use a time threshold of 1000 ms. This choice is not arbitrary and corresponds to the choice made earlier about seperating packages (we chose a number slighlty away from the bending point from the plot of the amount of unique packages as a function of the maximum time difference we allowed).
We see that some of the bigger packages "merged" as we now have two packages with over 35 measurements. The next step is implementing the chi-squared function we defined above. And calculating the optimal combination coordinates.
To say something about the usefullness of this method, we also printed the average of the locations of the packages that are combined and the difference between that simple method and the chi-squared minimization method presented above. The highest diffence was 0.7 meters, which is not that high if you consider the uncertainties of the original locations.
Concluding remarks
To summarize, we used packages recieved by 11 routers at the KPMG restaurant to assess the positions of a mobile device. To do this, we used the Friis equation in combination with chi-squared minimization. The system resolution was 1.9 meters, which is quite high compared to for example GPS. It would be nice to have annotated data (e.g. obtained using a motion capture system) to be better able to assess the noise in the system and try to find patterns in this noise so you could improve the resolution of the system.
We did not get to implementing the noise assesment for the combined packages, but would like to share our thoughts on this issue. The estimation of the noise might be done on similiar method as previous exercise. By use of a taylor expansion and obtaining the covariance matrix.
A final remark: before we created a chi-squared method for the positions, we just combined the packages and performed the analysis in an equal manner as our main analysis. It would be interesting to compare that approach to the one suggested above. For real time application, this approach would be better since it requires less computations.