Internette kisisel bilgilerimizi korumak

Google aradiginiz kelimeleri kaydeder. Google’a ulastiginiz IP adresleri, cografi konumunuz, kullandiginiz bilgisayarin isletim sistemi ve versiyonu, kullandiginiz internet tarayicisi, ekraninizin cozunurlugu gibi bilgileri sonsuza kadar saklar. Google da size ozel reklam gosterebiliyor bu sayede. Bu tabii ki yeni bir bilgi degil. Ama bunun cok sorunlu tarafi var. Bu bilgiler satilabiliyor. Bu bilgiler calinabiliyor. Bilgisayarinizdaki cookie’lerle karsilastirmalar yapiyor cesitli siteler.

Sizi nerdeyse nokta atisiyla kisi bazinda taniyip, sizin hakkinizda “cinsiyeti kadindir, 30 yaslarindadir, mantar hastaligi ve mide sorunlari olmus, cocuk mamasi aramasi yaptigi icin de muhtemelen cocuk sahibi, Chrome kullanicisi, cok teknik bir kisi degil vs. vs.” diye sonuclar cikarabiliyorlar. Bu bilgilerin saglik sigortasi sirketlerine satildigini dusunun! Hele hele bu kisinin Gmail hesabi da varsa ona gonderilen butun emailler zaten otomatik olarak taraniyor. Guya anonimize ediyorlar bu bilgileri ama sizin internet kimliginizi, online davranislarinizi tespit etmeleri cok kolay. Zaten mahkeme karari oldugunda veya kendi ulkelerinin gizli servisleri bu bilgileri istediklerinde bunlari vermek zorundalar.

Sonuc? Internet guvenliginize dikkat edin. Kucuk bir baslangic olarak Google’da direkt arama yapmak yerine su siteleri kullanabilirsiniz:

https://startpage.com/

https://duckduckgo.com/

Bu siteler sizin nerden baglandiginizi defterlerine yazmiyor, sizin yerinize aradiginiz kelimeyi Google’da arayip sonuc sayfasini size geri getiriyorlar.

A Distributed Bioinformatics File System

I have recently explored the idea of a Distributed Bioinformatics File System in Eagle Genomics’ blog.

In this article, I mentioned about GlusterFS, a distributed file system, and talked about my previous experience in installing it under Amazon Web Services for a Bioinformatics project.

Finally, I enumerated two potential features that a future filesystem, Distributed Bioinformatics File System (DBFS) should preferably have: data deduplication and delta encoding.

I have received interesting and encouraging comments from one of the GlusterFS developers, who informed us that they have been actually already experimenting with integrating these features!

Read the full article here.

Are encrypted Skype calls secure enough?

Is Skype really very secure? We know that it uses 256 bit AES encryption to encrypt communication between users. But does that mean that Skype is not capable of eavesdropping calls or chat messages? This web site suggests that they provide some kind of text filtering on Skype communication in China.

Also, it is no secret that after being bought by eBay, they provided user information to the US government, before Microsoft acquired Skype. But today I am going to talk about whether we can extract any information from an encrypted Skype communication only from the sent/received internet packets, without actually attempting to do any decryption.

Setup

In this experiment, Skype version 4.0.0.224 was used on a Windows XP machine (caller side) to make a voice call to another Skype client (version “2.2 Beta for Linux”) running on an Ubuntu machine (receiver side). The machines are co-located and belong to the same network. The internet packets were captured by using Wireshark on both sides, which is a popular open-source network protocol analysis software that is available both for Linux and Windows.

On the caller side, an audio file of length 158 seconds was played directly to the audio input device, that is, no microphones or recording equipment were used. The audio file (8000 Hz, mono, 16 bit, PCM wav file) is a phone conversation between one female and one male American English speakers.

Data analyses

Captured packets were filtered to eliminate all incoming and outgoing non-Skype packets: All non-UDP (TCP, ARP, etc.) packets were ignored and only packets sent from the caller to the target Skype port were considered.

Packet time stamps and packet sizes were parsed from the filtered logs, on both parties. A comparison between the two log files revealed that the packet sizes were consistent on both parties but there were differences in time sent / time arrived values.

During Skype calls, on average, a packet was sent every 0.02 seconds. The average packet size was calculated to be 135.7 bytes, 42 bytes of which is a fixed header section, while the remaining part is Skype’s encrypted data.

I repeated the test several times. While packet sizes were consistent on the caller and receiver sides within each attempt, I observed some discrepancies in packet sizes when making a new Skype call despite using the same audio file and setup. The below table contains two sets of packet sizes corresponding to the same audio segment, taken from two separate Skype calls in which the same file was sent and received. In order to get this table, I had to do some shifting of the data to find the exact place where the two data sets correspond to each other.

140

137

139

136

136

140

129

123

132

119

124

97

96

91

93

94

98

104

107

99

107

120

121

120

121

135

Table 1: UDP packet sizes comparison for the same audio region in two separate Skype calls.

As I mentioned above, packets are not sent in regular intervals, so using the packet size information alone will create a quite noisy representation of the actual audio waveform (data not shown). Instead, if we attempt to approximately match the audio waveform to the sent packages by

di = si / ti-ti-1 (Eq 1)

where si is the size of the ith packet, and ti is the time stamp for the packet, we obtain the below plot, which is the corrected plot showing the information content sent in a unit time:

Figure 1: The bottom panel shows the actual audio waveform, and the top panel shows the di values obtained from equation 1.

As it can be seen in Figure 1, the plotted di values roughly follow the general waveform pattern. Although this representation is better than using packet sizes along a homogeneous time axis (where time intervals are fixed), it is still very noisy.

Figure 2 demonstrates the nature of variation in time stamps of consecutive packets. The y-axis represents the time intervals between successive packets, while the x-axis corresponds to packets numbers. An interesting observation one can make from this plot is the fact that while the time intervals differ, they appear to bear only a fixed number of preferred values, instead of some random values from a continuous time range. That is, the delay can be 0.015 or 0.02 or 0.025 seconds and so on.

Figure 2: Time differences (y-axis) in seconds between consecutive packets (x-axis) sent, for a small audio segment.

Next, I created “time interval, packet size” tuples for two Sykpe calls in which I sent the same audio file. Based on an analysis done on the entire audio, I allowed time difference values to be any of the pre-determined 8 different time intervals (Figure 2 shows roughly 6 conserved values, but I think there were 8, looking at a wider range of data). The question: for the two calls, will these delay/size patterns be the same? I computed the total Euclidean distance between the windows of length 200 tuples for the two calls to find the best matching tuple positions in the two calls. The best matching positions were not sequential, i.e., there was no consistency between the preferred time intervals over the same regions in the two calls. That is, the observed packet time intervals are not consistent across multiple calls (for the same audio) and thus, they are independent of the data. But as we saw in Table 1, there is a general similarity, albeit a bit rough, between the packet sizes of the same audio segments over two separate Skype calls.

In the next step, I used a longer time window to compute the d values. Even if there are variations in time periods between receiving any two consecutive packets, fluctuations in time intervals should be negligible when considering packets sent in a sufficiently large time span. Instead of working with only two neighboring packets, this time I used groups of 20 (denoted by k below)  packets to work out the corrected d values (Eq 2), which game me the plot in Figure 3b.

di = si+k / ti+k-ti-k  (Eq 2)

Figure 3: a) The bottom pane shows the original waveform, b) the middle pane shows the calculated d values over k=20 from a Skype call, c) the top pane is the power plot of the original audio waveform.

Notice how Equation 2 echoes the power plot of the original audio (Figure 3)! It seems, even if Skype communication is encrypted and secure, it can be still possible, for example, to search for certain phrases by comparing energy patterns of a phrase and the mapped version of packet size data using this equation. Another application of using packet data information could be speaker diarisation. Encrypted audio data size and audio speaker compression rates are closely related. If two speakers feature distinct voice characteristics, the corresponding Skype data packet sizes could hint about who is speaking when, by following the explained methodology.