Smart devices and the internet have put billions of pages of data in our pockets.
But we’re overloaded with so much information that it can become unusable.
A Google search for ‘best cat meme’ generates 2.9 million results.
Love? 2.2 billion results.
How can we possibly find what we’re looking for?
MATHS WITH WORDS
Majigsuren Enkhsaikhan believes she might have some of the answer.
The UWA computer science PhD student is looking at ways to extract structure and knowledge from textual data.
To do that, she needs to make a computer mimic the way we humans use our brains to understand text.
Unfortunately, computers can’t read. They’re not designed to be good at primary school comprehension tests.
But where computers do excel is at maths.
Majigsuren has recently started testing ways of transforming words into vectors so computers can understand them.
This means the words are given numbers to represent the meaning of the word.
For instance, king might be given a high number ranking for maleness and queen a high ranking for femaleness.
Cat might be associated with domestic while lion might be associated with zoo.
Once you’ve turned words into maths, it’s then possible to do simple calculations on the words.
For example, king – man + woman = queen.
“You can do mathematical calculations on word vectors,” Majigsuren says. “That will help to find the new information.”
ARE MEN HAPPIER THAN WOMEN?
Majigsuren says using word vectors, text can be mined in order to make sense out of the billions of words on the internet.
She decided to test a few controversial sayings in her research.
Majigsuren came across the phrase ‘men are happier than women’ and wanted to prove it wrong.
So she ran a test to check how often the word happy occurred with the words men and women, using a program trained on the 100 billion words of Google News.
Happy occurred alongside the word men 5% more than the word women.
Not admitting defeat, Majigsuren ran the same test with the word sad.
Sad was associated with women 20% more than with men.
“So that didn’t work as I hoped,” she laughs.
LOVE IN THE AGE OF THE INTERNET
What if we wanted to ask a more philosophical question?
Majigsuren asked her computer to calculate love – talk + intimacy = ?
It came back with affection, romance and unconditional love.
“And then I took love – intimacy + talk,” Majigsuren says.
“It returned know, nobody cares, despise, hate and yeah, yeah.
“Google News is coming up with phrases like ‘yeah, yeah’, which is in their vocabulary.”
Ultimately, Majigsuren says the technique could help us find the right information on the internet, answering questions and recommending things users may want based on their search terms or questions.
THIS ISN’T THE DATA YOU’RE LOOKING FOR
Majigsuren says more data isn’t always better.
When she searched for words associated with gold on her Google News dataset, the most similar words were silver and precious metals.
But when she ran the test on a geological dataset of WA mineralisation reports, the most similar words to gold were mineralisation and Kalgoorlie.
“If you want domain-specific knowledge, you need to choose the domain-related data,” Majigsuren says.
Ultimately, Majigsuren says the technique could help us find the right information on the internet, answering questions and recommending things users may want based on their search terms or questions.
“For example, if a user searched information about Perth, the system could find the information about Perth, but it also could recommend the must-see places and must-do activities in Perth,” she says.
“That would be useful knowledge.”