Now this is cool. Tagyu is an “auto-tagging” service of sorts, created by Adam Kalsey. You paste in some text (or submit via their REST API) and it suggests tags, using some kind of a similarity metric between your text and already tagged texts in Tagyu’s index (gathered from del.icio.us etc.).
So far, I’ve tried a few different texts, and about half the time the returned tags are great. This is impressive, because this is not an easy problem to solve, but 50% precision is not quite enough for prime time. If someone (sploggers?) unleashes Tagyu to auto-tag a large volume of posts that feed back into the del.icio.us and Tagyu system, that would be detrimental to improving precision of the system, unless you could assign some kind of a score to the quality of tags (yes, that’s a chicken/egg thing).
This is the first service of this kind that I’m aware of, and there are lots of applications of this kind of thing in blog search. There could be an ad-matching app in there, too. And, an intermediate step in Tagyu is matching content to other content (and then to tags). I hope Adam Kalsey keeps up the R&D effort on this. Tagyu has a super-clean looking site. Very nice.
btw, for this post’s text, Tagyu returned the following tags: tagging del.icio.us tools. Looks good to me.
I downloaded a 30-day trial of Filemaker for OS X about 2-3 weeks ago. Had some ideas for a notetaking system with tagging, dynamic cross-linking, flexible querying, stuff like that. I haven’t had time to even unzip the trial, but I’ve already received two phone calls and two follow-up emails from Filemaker sales reps. Don’t they have anything better to do? Isn’t Filemaker selling without this kind of pestering? What’s wrong with these people? If I have trouble with it, I know where to go. If I want to buy a license, I know where to go for that, too.
At this rate, it’s unlikely that I’ll even take the time to install the trial. I’ll keep using MacJournal and see if I can uncover some features that get me closer to what I’m imagining my note taking application to be. MacJounral is a nice piece of software, and I haven’t been pestered by them once.
If you’re a word nerd or, heck, if you just speak or are learning English, you’ll enjoy The Word Nerds, a weekly podcast by two brothers on topics like The Cold War and Hostile Language, Collective Nouns, The Unnamed Antecedent and segments like The Rude Word of the Week. It’s obvious these guys spend hours preparing for their podcasts. They don’t just sit down and start talking (I won’t name names. Well, maybe in a future post I will). Anyway, The Word Nerds gets the coveted remylabs podcast seal of approval, which means I’ll stay subscribed in
iTunes NetNewsWire (does OPML, unlike iTunes) as long as they keep ‘casting.
Article in NY Times today, Yahoo is wooing I.B.M. Technical Talent:
Yahoo plans to announce Thursday that it is recruiting scientists who pioneered an advanced search-engine technology at I.B.M.’s Silicon Valley research laboratory.
Prabhakar Raghavan, a computer scientist who once led the Clever effort, joined Yahoo last week as head of research. He left I.B.M. in 2000 to become a vice president and chief scientist at Verity Inc., a maker of search and retrieval software for corporations; he was later named chief technical officer.
Yahoo offers one of the best opportunities to explore new ideas in search, Mr. Raghavan said
One area that will be pursued is new search technologies related to digital media.
It’s been fun to watch Google being forced from the position of category killer to more-or-less evenly matched contestant over the last year or two. There’s a mind-boggling amount of innovation happening in search, which is levelling the playing field for new entrants, but even the stuff we’re seeing now is only the beginning. Search, and other modes of information retrieval, will become even more ubiquitous and integrated than they are now, and we’ll wonder how an OS like Windows without integrated search ever came to dominate a market. The desktop market itself may go away (yes, I’ve been reading Paul Graham’s book Hackers and Painters, which contains this great essay on server-based software from 2001, which is still relevant and engaging, as are his many other essays).
Search is poised to become the great collective memory, and new research being brought to market in real services, along with the availability of public APIs, will speed progress toward that reality. But it won’t be just the extent of information covered by search that will grow, but also interconnectivity of seach services and, most importantly, new modes of retrieving information (the only mode now in widespread use is keyword search, which is as old computer science itself — or much older, if you count manual versions such as file cabinets and card catalogs and other manually compiled indexes). I don’t see any reason why search shouldn’t aim to duplicate in software all of the modes in which humans retrieve information in their own brains (by context, by association and so on) or from others, by interactive question answering or guided discovery.
Steve Rubel and Niall Kennedy are reporting on a Yahoo RSS search service which was briefly public this morning. Seems to combine feed search (not just blogs, apparently, but other feed content, too, like Feedster) and several ranking options (date, relevance, and popularity). I’m curious about the popularity ranking, but I’d guess the initial version will resemble a Technorati-like tally of incoming-links.
Greg Linden wonders whether the small blog/feed search engines will survive the entry of the giants into the field:
… it is good for a startup to see the entry of a big company into its area since it attracts attention and legitimizes the field … but competing directly against these giants is scary if you have no differentiator.
While the small players have driven innovation and broad acceptance of concepts like link popularity and tagging, they continue to struggle with scalability. Also, the most compelling products to come out of the blog search startups, while they’ve been exciting and even revolutionary from a user’s point of view, have not been technologically deep in the sense of difficult to duplicate by the search giants. There have been exceptions, of course, but no really deep technology is in evidence among those services that have made the biggest splashes (technorati, bloglines, flickr, del.icio.us).
So, when a search giant comes in with equal-or-better features, scalability, and a huge engineering team that can relatively quickly merge ideas emerging from the programming part of the blogosphere into the vast search toolkit that the giants already have, that might just cast a bit of a cloud over the little guys.
Having said that, I believe there will continue to be a place for the little guys in the blog search ecosystem. They’re the real innovators and they have their ears to the ground. And even at the break-neck speed at which Yahoo and Google have been rolling out features lately, an army of little guys can still cover a lot more ground than the two giants in the search for the next cool thing that will make users’ lives (even) better.
Susan Mernit asks some interesting questions about tagging’s scalability in this post:
1. How well will tagging work as an organizing and information retrieval method when there are millions of tags?–That’s where having additional filters, such as identity, trust or cohort group becomes relevant–becomes needed.
2. How can developers move tagging into a wider market? I describe tagging to non-geek friends and they are interested, but these folks aren’t blogging, don’t use tag-friendly photo services and are a world away, still–how can the tools bring them closer?
I don’t think that tagging will turn out to be the emperor’s new clothes (which isn’t at all what Susan is suggesting, either). But there’s a sense here that the honeymoon is over and it’s time for tagging to get serious about earning its keep for readers and searchers and to make stuff not just more broadcastable in flickr and Technorati, but also to make the good stuff more findable.
Interesting informal experiment in the Yahoo! Buzz Log:
Do you find blogs via links or through search? We wanted to know, so we lined up the top 20 blogs in the blogosphere according to Technorati … we did peek at the number of searches each received over the last week.
Turns out the lead blog on Technorati runs in the middle of the search pack. Fark ranked #5 on Technorati, but in terms of searches — it’s the top dog, er, blog.
Note that Yahoo! restricted its sample set to the top 20 blogs according to Technorati, so if all blogs were lined up according to search popularity on Yahoo, the top 20 might or might not include any of the Technorati top 20.
So, what’s this mean? It means that searches map to a different popularity ranking than links do. They’re two different measures. Technorati takes a sort of intra-blogosphere measurement of popularity, while Yahoo! is taking an external measurement of blog popularity, looking into the blogosphere from the web at large (or at least from search.yahoo.com). If you’re a marketer, you probably like Yahoo’s measurement better.
(via Steve Rubel)
Greg Gershman has built a cool application of Google Earth. You can jump from Blogdigger Local search results to Google Earth and see markers for all of the blogs in your geo neighborhood. The result looks something like this:
(That’s Greg’s image. Don’t have Google Earth running here, waiting for the OS X version. Impatiently.) Blogdigger seems to have found its niche with Blogdigger Local, and it’s a good one.
Interesting article in Sunday’s NYT about a Lawrence, Kansas paper including user-generated content in a big way:
“I don’t think of us as being in the newspaper business,” said Mr. Simons, the editor and publisher of The Journal-World and the chairman of the World Company, the newspaper’s parent. “Information is our business and we’re trying to provide information, in one form or another, however the consumer wants it and wherever the consumer wants it, in the most complete and useful way possible.”
“We believe that journalism has been a monologue for so long and now is the perfect time for it to become a dialogue with our readers,” said Rob Curley, 34, the World Company’s director of new media. “We want readers to think of this as their paper, not our paper.”
Danny Sullivan is skeptical about the accuracy of Google’s and Yahoo’s results counts, used by Tristan Louis in two studies, which concluded that Yahoo has better coverage of blogs than Google, which in turn has better coverage than Technorati. Danny posted an email conversation with Tristan about his study. It’s a little hard to follow the lines of argument, but it’s well worth reading because it illuminates the difficulties in getting a handle on index size, and especially blog coverage, by the search giants.
Danny, from his exchange with Tristan:
Also, Google did say “of about” with the numbers it reports. That’s not an accident. They’re saying that this is an estimate. But no disagreement with me. If you put up a count, it would be nice if the count was as accurate as possible. Google’s have come under question.
Hmm. From what I’ve seen in Tristan’s data and my own testing, it’s Yahoo’s counts that ought to come under question, specifically for link: queries.
Danny to Tristan again:
The link: command is completely different than the site: command. The link command tells you nothing about the size of the index. As for a confirmation that all links aren’t reported, this past blog post from SEW gives you confirmation and this page on Google mentions links are only a sampling of what Google knows although this other Google page fails to make this clear.
link: and site: are very different, that’s true enough. And maybe the link command doesn’t tell you much about the size of an index, but if link collection methods are similar between Yahoo and Google (and why wouldn’t they be, it’s a relatively easy part of the whole game), then the counts ought to be similar. But they’re not, not by a long shot.
By the way, a big thanks to Tristan for posting his studies and kicking off this discussion. Most of us don’t take the time to do analysis of that depth to support our opinions, and to post the entire method and dataset so others can reproduce it, shoot holes in it, go off on tangents from it.
(I stumbled onto Danny’s post via John Battelle)