Fitting text into categories

How to classify?

You can use various methods to classify objects. There are many tools in machine learning and natural language processing packages. But sometimes it's hard to find one, which suits your problem well. I wanted to write category classifier in scala, making it as simple as possible. I could use breeze, but at first glance it's quite overwhelming and documentation isn't really user friendly. (I love python scikit-learn, which is in my opinion the best machine learning library in the world)

I decided to implement it all by myself using only basic SDK of scala.

Main idea behind my script

In written language exist lots of common words such as: "the", "and", "this", etc. Specialised article or book contains lots of vocabulary connected to the field.

I did some web crawling to collect some articles with typical vocabulary for three basic fields: physics, electronics and mathematics. As a sample of normaly used vocabulary (not conected with any specialised field) I used some random articles from Guardian.

After collecting articles I could compute the most common worlds for each category. My classifying algorithm basically chooses a category, which vocabulary has the biggest intersection with given text, if intersection is very small it returns category "boring".

Implementation

{% codeblock splitContentToSentences - commonVocabulary.scala %} import java.io.RandomAccessFile import scala.collection.mutable.HashMap

def removePunctuation(text: String) = { val punct = ",.?;:!\"" text.toList.filterNot(char => punct contains char).mkString("") }

def getWordsFromRawText(text: String) = { removePunctuation(text).toLowerCase.split("\s+").toList }

def getAllWords(basePath: String) = { import java.io.File val baseDir = new File(basePath) val paths = baseDir.listFiles.toList.sorted var words = ListString for( path <- paths) { val raf = new RandomAccessFile(path, "r") val buff = new ArrayByte raf.readFully(buff) words = words ++: getWordsFromRawText(new String(buff)) } words }

def rankWords(words: List[String], normalVocabulary: Set[String]) = { var wordOccurences = new HashMapString, Int for(word <- words if !( normalVocabulary contains word) ){ if (wordOccurences contains word) { wordOccurences(word) += 1 } else wordOccurences(word) = 1 } wordOccurences.toList.sortBy(_._2).reverse }

val basePath = "articles/physics/" val normalVocabulary = getAllWords("articles/random/")

val commonNormalVoc = (for {word <- rankWords(normalVocabulary, Set("")).slice(0, 400)} yield word._1).toSet

val physicsVocabulary = getAllWords("articles/physics/") val commonPhysicsVoc = (for {word <- rankWords(physicsVocabulary, commonNormalVoc).slice(0, 400)} yield word._1).toSet

val electronicsVocabulary = getAllWords("articles/electronics/") val commonElectronicsVoc = (for {word <- rankWords(electronicsVocabulary, commonNormalVoc).slice(0, 400)} yield word._1).toSet

val mathVocabulary = getAllWords("articles/math/") val commonMathVoc = (for {word <- rankWords(mathVocabulary, commonNormalVoc).slice(0, 400)} yield word._1).toSet

val testArticleAboutPhysics = """ Thermodynamics is a branch of natural science concerned (...) or statistical mechanics, gave explanations of macroscopic thermodynamics by statistical predictions of the collective motion of particles based on the mechanics of their microscopic behavior. """

val testArticleAboutMath = """ Algebra can essentially be considered as doing computations (...) a polynomial in a single variable. """

val testArticleAboutElectronics = """ Wheatstone bridge (...) zero the voltage. """

val typicalGuardianArticle = """ The French president, (...) have legalised same-sex marriage. """

def matchCategory(text: String) = { val words = getWordsFromRawText(text).toSet val wordsCount = words.size val categories = HashMap("physics" -> (words intersect commonPhysicsVoc).size.toDouble / wordsCount * 100, "electronics" -> (words intersect commonElectronicsVoc).size.toDouble / wordsCount * 100, "math" -> (words intersect commonMathVoc).size.toDouble / wordsCount * 100).toList.sortBy(_._2).reverse

val winningCategory = if (categories(0)._2 > 10)
 categories(0) 
else 
    ("boring", (words intersect commonNormalVoc).size.toDouble
     / wordsCount * 100)
println("Article matches category: " + winningCategory)
println(categories)
println("Similarity to normal vocabulary: " +
 (words intersect commonNormalVoc).size.toDouble 
 / wordsCount * 100)

}

println("Testing article about physics:") matchCategory(testArticleAboutPhysics) println("\nTesting article about math:") matchCategory(testArticleAboutMath) println("\nTesting article about electronics:") matchCategory(testArticleAboutElectronics) println("\nTesting typical guardian article:") matchCategory(typicalGuardianArticle)

I stored collected articles in directory "articles/{category}" and used them to generate vocabulary sets.

There are two important functions in this code: "rankWords" and "match category". Ranking is based of how many times given word occurs in the text. Matching categories using set intersections to estimate which category suits the best.

Here is the output of this script (test files were much longer):

{% codeblock commonVocabulary.scala output%} Testing article about physics: Article matches category: (physics,34.66666666666667) List((physics,34.66666666666667), (electronics,10.222222222222223), (math,8.0)) Similarity to normal vocabulary: 23.11111111111111

Testing article about math: Article matches category: (math,26.01880877742947) List((math,26.01880877742947), (physics,21.003134796238246), (electronics,13.793103448275861)) Similarity to normal vocabulary: 23.824451410658305

Testing article about electronics: Article matches category: (electronics,23.717948717948715) List((electronics,23.717948717948715), (physics,16.666666666666664), (math,14.102564102564102)) Similarity to normal vocabulary: 27.564102564102566

Testing typical guardian article: Article matches category: (boring,30.45977011494253) List((electronics,2.8735632183908044), (math,2.2988505747126435), (physics,2.2988505747126435)) Similarity to normal vocabulary: 30.45977011494253

It took some code, but it's actually pretty simple and works well. Scala offers really nice api to work with this kind of problems.

Opublikowano: | Źródło
Comments powered by Disqus
Share