OMG. OMG. OMFG.

GenderAnalyzer: Girls Who’re Boys Who Like Boys To Be Girls

I had a thing with a guy on the internet when I was in college. We exchanged e-mails for a while; I knew he was into me. I knew it even though he constantly worried that I was a bored 40-something mid-western dude toying with his emoticons.

He’s never given me any reason as to why he thought I was a man.

Neither did the GenderAnalyzer when I plugged in my blog to analyze my gender.

Curious to see how they’d arrived at this conclusion, I contacted the team at GenderAnalyzer and asked. Jon Kågström, the brains behind the operation, wrote me back to fill me in on the details of this wondrous machine:

It all started back in 2004 when I was doing my master thesis on machine learning for spam filtering. I was fascinated by how well they worked on spam (often better than humans) and started to wonder what more than spam text classifiers could be used for. I did a big number of hobby research projects testing classifiers on different domains (e.g., sentiment, happy/sad, web page categorization).

Doing this while improving the classifier technology derived from my master thesis, I came up with the idea to let everyone have access to classifiers. So with help from two friends, Roger Karlsson and Emil Kågström, we built uclassify.com. The idea is to share advanced classifier technology that is easy to use (don’t even have to be a programmer) for free in a Web 2.0 format.

After we had finished with uclassify.com we decided to test if it’s possible to have a computer differentiate between males and females by looking at their text. We found the idea really interesting so we collected 2,000 blogs written by males and females and used the uClassify API to train a classifier. We decided to put it into test and created GenderAnalyzer.

The accuracy is lower than we expected and we believe a major reason for that is that the training data is biased (only collected from blogspot). We think we can get better accuracy by using the URLs that users test as training data (when they vote if it worked or not, we train it accordingly). In this way we would have a classifier that adapts to real world gender data. Just as machine learning spam filters do.

Today on the uClassify blog, Kågström elaborated on the Analyzer’s current low accuracy (53%):

Our training data of 2,000 blogs is automatically collected from blogspot. Running internal tests (10 fold cross validation) on this data gives us an accuracy of 75%. This effectively means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our training data is probably not very representative, as a matter of fact it’s very stereotypical.

When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given URL and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data.

We are trying to encode test data to utf-8 which is the format of the training data – it could be that we are missing some encodings.

It’s a worthwhile experiment and the technology is there to make the process of gender analysis possible. If there are any issues with GenderAnalyzer, further development of the tested blog samples will enable the Analyzer to increase its accuracy levels, and as more people make use of it and the machine gets more training, it will become better equipped to answer the question “man or woman—who is writing that blog?”

Though Kågström brings up a good question in the uClassify’s blog post, maybe “the difference between male and female writing is not significant?”

Only one way to find out. Best of luck to the team at GenderAnalyzer.

As for the college e-fling, we still talk form time to time. We’re both married now—to other people. I think he finally believes that I’m a woman. It only took him five years.

Recently on omgomgomfg


15 Responses to “GenderAnalyzer: Girls Who’re Boys Who Like Boys To Be Girls”

  1. Semper



    Perhaps the low accuracy could be, somehow, related to gender fluidity? Honestly, the project seems kind of ridiculous to me, for this reason, and for the one that you sited- is it really significant?

    OMG, Sempers last blog post: null

    reply

  2. gracie



    when i tested my blog a few weeks ago… i was a man. i thought it was the word usage because i used “cunt” all the time.

    go figure.

    reply

  3. Coco_beans



    I got 98% man. 98%

    I guess the other 2% is the (small) space left by my missing penis.

    reply

  4. the girl Riot



    re: Semper–i think you hit on an interesting point, as well.

    i know, AV, that you’re near as huge a Winterson fan as i am; when i read your post, some of her experimental word usage in Written on the Body sprang to mind. can sex (gender? as a better word here as words are representational?) be determined by arrangement of words? as much as we are ’sexless, genderless narrators,’ are we given away by our verbs?

    OMG, the girl Riots last blog post: what runs at a speed of 104?

    reply

    AV Flox Reply:

    I love that you picked up on this! Winterson crossed my mind her share of times as I was developing this piece.

    reply

  5. brooks bayne



    “We have strong indicators that http://brooksbayne.com is written by a man (94%).”

    Coco’s writing is more manly than mine lol. My last post was about bacon.

    OMG, brooks baynes last blog post: Mo’s Bacon Bar by Vosges

    reply

  6. unreliable narrator



    Yep, it thinks I have a willy too (but only a wee one, at 57% XY). Is it cos I swear, or use subordinate clauses?

    reply

  7. Beck



    I got a 71% female:)

    OMG, Becks last blog post: Same Girl’s Paradise

    reply

  8. Jon Kågström / Genderanalyzer



    Hey everyone! Good post Anaiis! Just wanted to let you know we have new training data now. Perhaps it will work better now? =)

    OMG, Jon Kågström / Genderanalyzers last blog post: GenderAnalyzer thoughts

    reply

    AV Flox Reply:

    Thanks for the update, Jon. It’s a worthwhile experiment and I’m not only speaking for myself when I say we look forward to seeing GenderAnalyzer improve!

    Do keep me posted, I would love to update this piece as the Analyzer evolves.

    reply

  9. joe



    I find that when I am on the internet I prefer all things “men” and I am a man.

    reply

  10. FelipeAzucares



    Wrt Pageviews vs. respect. Hmmm apparently nuthin’ but it seems … you can cut a lot of slack with a nice yack. Me, I look like a pig but write like a dove.

    OMG, FelipeAzucaress last blog post:

    reply

  11. Atherton Bartelby



    I was literally astounded to see that GenderAnalyzer was 67% certain that it was written by a man. One of my most frequent comments back in my old LJ days from new(er) readers was that they were certain it was written by a woman (a shockingly foul-mouthed woman, but a woman, nevertheless). I guess I don’t write about “Sex And The City” so alarmingly frequently anymore to skew the findings.

    Anyway, though, I loved this piece, because so often we stumble across internet memes / quizzes / machines and think very little of the amount of work, research, and knowledge that goes into their building. Fortunately we have people like you, however, to unearth the truly interesting story of Kågström’s creation of GenderAnalyzer. Brava!

    OMG, Atherton Bartelbys last blog post: Eyes Wide Open

    reply

  12. Porter



    Seventy-five percent womanly, my blog. (I am a man.)

    reply

  13. Palo Alto Real Estate



    I find this incredibly fascinating. I wonder over time, if they can get this to become more accurate or will the style we talk just blend more and more, causing the accuracy to drop?

    reply

Leave a Reply

CommentLuv Enabled

  • AV Flox writes about web culture; new media’s gradual overthrow of old media; trends in social media; and the complicated entanglements people get themselves into as we venture forth into this new world where, more and more, the analog is colliding with the digital.

  • Hosted by: