READ

Teaching computers to ‘read’ textual data

What happens if you ask a computer to calculate love – talk + intimacy?
​Michelle Wheeler
​Michelle Wheeler
Freelance science journalist
Teaching computers to ‘read’ textual data

Smart devices and the internet have put billions of pages of data in our pockets.

But we’re overloaded with so much information that it can become unusable.

A Google search for ‘best cat meme’ generates 2.9 million results.

Love? 2.2 billion results.

How can we possibly find what we’re looking for?

View Larger

A Google search for ‘best cat meme’ generates 2.9 million results.

A Google search for ‘best cat meme’ generates 2.9 million results.

MATHS WITH WORDS

Majigsuren Enkhsaikhan believes she might have some of the answer.

The UWA computer science PhD student is looking at ways to extract structure and knowledge from textual data.

To do that, she needs to make a computer mimic the way we humans use our brains to understand text.

Unfortunately, computers can’t read. They’re not designed to be good at primary school comprehension tests.

But where computers do excel is at maths.

Majigsuren has recently started testing ways of transforming words into vectors so computers can understand them.

This means the words are given numbers to represent the meaning of the word.

For instance, king might be given a high number ranking for maleness and queen a high ranking for femaleness.

Cat might be associated with domestic while lion might be associated with zoo.

Once you’ve turned words into maths, it’s then possible to do simple calculations on the words.

For example, king – man + woman = queen.

“You can do mathematical calculations on word vectors,” Majigsuren says. “That will help to find the new information.”

ARE MEN HAPPIER THAN WOMEN?

Majigsuren says using word vectors, text can be mined in order to make sense out of the billions of words on the internet.

She decided to test a few controversial sayings in her research.

Majigsuren came across the phrase ‘men are happier than women’ and wanted to prove it wrong.

So she ran a test to check how often the word happy occurred with the words men and women, using a program trained on the 100 billion words of Google News.

Happy occurred alongside the word men 5% more than the word women.

Not admitting defeat, Majigsuren ran the same test with the word sad.

Sad was associated with women 20% more than with men.

“So that didn’t work as I hoped,” she laughs.

LOVE IN THE AGE OF THE INTERNET

What if we wanted to ask a more philosophical question?

Majigsuren asked her computer to calculate love – talk + intimacy = ?

It came back with affection, romance and unconditional love.

“And then I took love – intimacy + talk,” Majigsuren says.

“It returned know, nobody cares, despise, hate and yeah, yeah.

“Google News is coming up with phrases like ‘yeah, yeah’, which is in their vocabulary.”

UWA PhD student Majigsuren Enkhsaikhan

UWA PhD student Majigsuren Enkhsaikhan
Ultimately, Majigsuren says the technique could help us find the right information on the internet, answering questions and recommending things users may want based on their search terms or questions.

THIS ISN’T THE DATA YOU’RE LOOKING FOR

Majigsuren says more data isn’t always better.

When she searched for words associated with gold on her Google News dataset, the most similar words were silver and precious metals.

But when she ran the test on a geological dataset of WA mineralisation reports, the most similar words to gold were mineralisation and Kalgoorlie.

“If you want domain-specific knowledge, you need to choose the domain-related data,” Majigsuren says.

Ultimately, Majigsuren says the technique could help us find the right information on the internet, answering questions and recommending things users may want based on their search terms or questions.

“For example, if a user searched information about Perth, the system could find the information about Perth, but it also could recommend the must-see places and must-do activities in Perth,” she says.

“That would be useful knowledge.”

​Michelle Wheeler
About the author
​Michelle Wheeler
Michelle is a former science and environment reporter for The West Australian. Her work has seen her visit a snake-infested island dubbed the most dangerous in the world, test great white shark detectors in a tinny and meet isolated tribes in the Malaysian jungle. Michelle was a finalist for the Best Freelance Journalist at the 2020 WA Media Awards.
View articles
Michelle is a former science and environment reporter for The West Australian. Her work has seen her visit a snake-infested island dubbed the most dangerous in the world, test great white shark detectors in a tinny and meet isolated tribes in the Malaysian jungle. Michelle was a finalist for the Best Freelance Journalist at the 2020 WA Media Awards.
View articles

NEXT ARTICLE

We've got chemistry, let's take it to the next level!

Get the latest WA science news delivered to your inbox, every fortnight.

This field is for validation purposes and should be left unchanged.

Republish

Creative Commons Logo

Republishing our content

We want our stories to be shared and seen by as many people as possible.

Therefore, unless it says otherwise, copyright on the stories on Particle belongs to Scitech and they are published under a Creative Commons Attribution-NoDerivatives 4.0 International License.

This allows you to republish our articles online or in print for free. You just need to credit us and link to us, and you can’t edit our material or sell it separately.

Using the ‘republish’ button on our website is the easiest way to meet our guidelines.

Guidelines

You cannot edit the article.

When republishing, you have to credit our authors, ideally in the byline. You have to credit Particle with a link back to the original publication on Particle.

If you’re republishing online, you must use our pageview counter, link to us and include links from our story. Our page view counter is a small pixel-ping (invisible to the eye) that allows us to know when our content is republished. It’s a condition of our guidelines that you include our counter. If you use the ‘republish’ then you’ll capture our page counter.

If you’re republishing in print, please email us to let us so we know about it (we get very proud to see our work republished) and you must include the Particle logo next to the credits. Download logo here.

If you wish to republish all our stories, please contact us directly to discuss this opportunity.

Images

Most of the images used on Particle are copyright of the photographer who made them.

It is your responsibility to confirm that you’re licensed to republish images in our articles.

Video

All Particle videos can be accessed through YouTube under the Standard YouTube Licence.

The Standard YouTube licence

  1. This licence is ‘All Rights Reserved’, granting provisions for YouTube to display the content, and YouTube’s visitors to stream the content. This means that the content may be streamed from YouTube but specifically forbids downloading, adaptation, and redistribution, except where otherwise licensed. When uploading your content to YouTube it will automatically use the Standard YouTube licence. You can check this by clicking on Advanced Settings and looking at the dropdown box ‘License and rights ownership’.
  2. When a user is uploading a video he has license options that he can choose from. The first option is “standard YouTube License” which means that you grant the broadcasting rights to YouTube. This essentially means that your video can only be accessed from YouTube for watching purpose and cannot be reproduced or distributed in any other form without your consent.

Contact

For more information about using our content, email us: particle@scitech.org.au

Copy this HTML into your CMS
Press Ctrl+C to copy