Numpy histogram density does not sum to 1

During a Computational Vision lab, while comparing histograms, I stumbled upon a peculiar behavior. The histograms pairwise kernel matrix – which is just a fancy name for the matrix holding histograms correlations one with another – did not have ones on the diagonal. This means that one histogram was not fully correlated to itself, which is weird.

numpy histogram integral not 1The comparison metric I was using is the simple histogram intersection one, defined as

    \[K_{hi} = \sum^{d}_{m=1}min(x_m,y_m)\]

Continue reading

The Distributional Hypothesis: semantic models in theory and practice

This was the final project for the Data Semantics course at university – A report on distributional semantics and Latent Semantic Analysis.

Here is the nicely-formatted pdf version (with references).

What is the Distributional Hypothesis

When it comes to Distributional Semantics and the Distributional Hypothesis, the slogan is often “You shall know a word by the company it keeps” (J.R. Firth).

The idea of the Distributional Hypothesis is that the distribution of words in a text holds a relationship with their corresponding meanings. More specifically, the more semantically similar two words are, the more they will tend to show up in similar contexts and with similar distributions. Stating the idea the other way round may be helpful: given two morphemes with different semantical meaning, their distribution is likely to be different.

For example, fire and dog are two words unrelated in their meaning, and in fact they are not often used in the same sentence. On the other hand, the words dog and cat are sometimes seen together, so they may share some aspect of meaning.

Mimicking the way children learn, Distributional Semantics relies on huge text corpora, the parsing of which would allow to gather enough information about words distribution to make some inference. These corpora are treated with statistical analysis techniques and linear algebra methods to extract information. This is similar to the way humans learn to use words: by seeing how they are used (i.e. coming across several examples in which a specific word is used).

The fundamental difference between human learning and the learning a distributional semantic algorithm could achieve is mostly related to the fact that humans have a concrete, practical experience to rely on. This allows them not only to learn the usage of a word, but to eventually understand its meaning. However, the way word meaning is inferred is still an open research problem in psychology and cognitive science.

Continue reading

Brute force a crackme file password with Python

I was to reverse a file for a challenge, MD5 hash 85c9feed0cb0f240a62b1e50d1ab0419.

The challenge was called mio cuggino, purposefully misspelled with two g letters. It asks for three numbers. The challenge led me to a brute force of the password with a Python script, learning how to interact with a subprocess stdin and stdout (SKIP to next section if you don’t care about context but only want the code).

Looking at the assembly with Radare, the first thing it does is to check that the numbers are non-negative and in increasing order. In details, it checks that:

  1. exactly three inputs have been provided;
  2. the first two are non-negative;
  3. the third is bigger than the second;
  4. the second is bigger than the first;
  5. the third is non-negative.

Very good, so the input pattern is three non-negative integers in increasing order. Fine. No clue about what those numbers should be though, yet.

Scroll the assembly just enough to unravel the magic.

A (pointer to) string is loaded into ebx, which contains the following Italian sentence:

Mi ha detto mio cuggino che una volta e’ stato co’ una che poi gli ha scritto sullo specchio benvenuto nell’AIDS, mio cuggino mio cuggino

The assembly basically takes the characters in the string that correspond to the first and second input (for ex, 0 as first input would map to the first char, M) and checks whether they are equal. If this is not satisfied, a Nope message is shown and the binary returns.

If this is satisfied, the same check is repeated with the third input (with the first one, although this doesn’t matter). If this is satisfied as well, a tricky sub.puts_640 function is called (with 5 inputs), and a Uhm message is shown.

Going to looking into that routine is absolutely useless as it’s completely unreadable, and even makes a bunch of additional calls that are further jumbled.

Continue reading

The one time pad and the many time pad vulnerability

The scope of this article is to present the one time pad cipher method and its biggest vulnerability: that of the many time pad.

The one time pad: what it is and how it works

The one time pad is the archetype of the idea of stream cipher. It’s very simple: if you want to make a message unintelligible to an eavesdropper, just change each character of the original message in a way that you can revert, but that looks random to another person.

The way the one time pad works is the following. Suppose \mathcal{M} is the clear-text message you would like to send securely, of length |\mathcal{M}| = s. First, you need to generate a string \mathcal{K} of equal length |\mathcal{K}| = s. Then, you can obtain a cipher-text version of your message by computing the bitwise XOR of the two strings:

    \[\mathcal{M} \oplus \mathcal{K}\]

The best thing is that decoding is just the same as encoding, as the XOR operator has the property that \mathcal{X} \oplus \mathcal{X} = 0 \ \forall X (and that \mathcal{X} \oplus 0 = \mathcal{X} \ \forall \mathcal{X}). The only difference is that the cipher-text is involved in the XOR, rather than the clear-text:

    \[\mathcal{C} \oplus \mathcal{K} = \mathcal{M} \oplus \mathcal{K} \oplus \mathcal{K} = \mathcal{M} \oplus 0 = \mathcal{M}\]

Below is an example of the one time pad encoding achieved with Python, with a made-up pad string.

In the first section, result holds the XOR result. In the second part, the result and one_time_pad variables are XORed together to obtain the original plain-text message again.

It is not difficult to realize that the whole strength of the algorithm lies in the \mathcal{K} pad. Of course, as an attacker, if you can obtain \mathcal{K} in some way, then it is not difficult to get the clear-text message from the ciphered one as well.

Continue reading

Getting started with Binary reverse engineering: an example

For a challenge in a university security class, I was given this file to crack: reverse1. I started with reverse0, which was considerably easier than the second one. In this post I will briefly explain how I tackled reverse1. I provided the files so you can you try on your own and then came back for hints if you are stuck! If you are new to this business, as I relatively am, I advise you to start from reverse0 and crack that first.

Hashes of reverse1 file: 
MD5 – c22c985acb7ca0f373b7279138213158
SHA256 – cd56541a75657630a2a0c23724e55f70e7f4f77300faf18e8228cd2cffe8248e

Disassembling and hoping for the best

The first thing I did was to disassemble the file with Radare to have a look at the code.

The assembly is quite jumbled up, and difficult to analyse all together. A quick look tells us that trying to crack the file just by reversing the assembly is no easy task, and actually a silly idea to begin with. There’s a cycle after the password is read from standard input, then some other instructions, then another cycle… it’s difficult to get what is going on…

Instead, let’s seek the Bad password print section, and see what should happen for the code to jump there. If we are lucky enough, we may find a bunch of final checks that will send over to the Bad password section. If we can find those, we may then look at those bits of assembly to understand how to avoid going there.

Scroll down enough, and down at the bottom I can see the Bad password part, starting at 0x080484f0.

Radare helps in showing two different arrows going into this address. The related comparisons are the following:

Continue reading

Base conversion in Ubuntu (decimal to binary)

Need to convert a base 10 integer in a base 2 one? Or, at any rate, convert a number from one numeration system to another? In Ubuntu, the bc utility already integrates these features. It is usually already installed, so you don’t have to anything special.

Simply run bc, and enter the following commands:

Then, all subsequent number inputs will be simply converted to their base-2 representation.

If you want to get a conversion straight ahead, without going through the opening of bc, just enter the following from a terminal:

which will convert the number 123 from base 10 to base 2.

Of course, 2 and 10 can be replaced with any other possible base!