Archive for the 'Web Development' Category

404 error checker and site crawler

Friday, October 12th, 2007

Google punishes sites heavily for 404 errors. By the time you realize your site has an error its usually too late and you’re already being punished for them. I suggest you stay proactive on your 404 errors and use a link checker. I found this extremely useful tool. Xenu’s Link Sleuth. It basically crawls your entire site for every single internal and external link. You can chose to ignore external links if you want as well and just focus on internal links. It even visits images and mailto and just about anything that has a ’src’ or ‘href’ in the html of your site. I considered it a nice toy when I first used it but when it quickly found numerous serious 404 issues on a few of my sites I upgraded the importance of this tool in my toolbox. This will keep you ahead of the curve instead of constantly playing catchup. Try it out on your site and I guarantee you’ll find 404s you had no idea existed. It shows you all the sites that have the bad links on them as well so you know where to go to correct the problem. Best of all this software is free! Its a Windows application unfortunately but any self respecting web developer has virtual machines with different operating systems on them, Windows being one of them, so that shouldn’t be much of a problem if you’re a serious developer.

Rails form select integer drop-down helper method

Friday, October 12th, 2007

I’ve often come across situations while developing Rails apps where I just want a simple integer drop-down box. The default Rails helpers for selects and its options aren’t really geared for something simple like that. I don’t want to have to create a collection of integers and pass them into blocks or any other ridiculous workaround in my views. I want them clean and simple. I created a helper function which allows you to easily create integer drop downs. Just toss this in your application_helper.rb.

And in your view simply call:

And you have yourself an integer dropdown from 1-20. I tried to make the options and select formatting and id/name conventions the same as the rest of the Rails select/option helper methods to keep things consistent.

The future of search

Tuesday, October 2nd, 2007

I’ve been ranting and raving lately about how Google’s search sucks. There are numerous reasons but lets just focus on relevancy of the results for this rant. Anyone using Google lately has seen the spammy websites that come up on search results. By spammy I mean those sites which are nothing more than screen scrapers, web directories, google adwords pages that use search results to generate static pages with scraped content mixed with more adwords, and on and on. Most of these junk sites have tons of Google adwords all over them and so of course why would Google care if they are ranked #1? They don’t and that’s precisely the problem.

I’m rambling on about Google because anyone with a profitable website knows that Google is your primary traffic driver (most of the time of course). Google used to weight external links to your site very heavily. As a result, people started creating link farms and easily getting around that. External links still count of course but more for going from one tier to the next in their ranking scheme. Yes, there are multiple tiers. Relate that to primary and secondary indexes and you’ll know what I mean. Since everyone realized how easy it was to fool Google with external links to your site they altered their algorithm ever so slightly over the years to make internal linking much more important. That’s why you see all these junk sites now a days. They’re is a very straight forward way to create a site with a good internal linking structure. Think of tags, relevant tags, and similar concepts along with your traditional site hierarchy type linking structure as the way to create a well connected internal linking structure. Google will eat this up and the junk sites that employ this sort of design are proof that they rank internal linking much higher than external links to your site.

Now what’s one to do about all this mess? People and business are always going to find ways to get ranked high in search engines. Its the name of the game in online commerce. As a result, there will always be junk sites like the millions that Google is indirectly creating (because their algorithm favors them). The solution that I see is a combination of ideas that are already present in their own forms in one way or another. Search results need to learn who I am and what I mean when I use certain word. For instance, searching for the word ‘rails’ might mean I’m looking for trains or it might mean I’m learning about Ruby on Rails. A good search engine of the future would learn from my search behaviors and somehow be able to pick the context out of the words I’m using. It needs to learn what sort of sites I favor over others. I hate Google adwords junk sites yet I get them all the time. This sort of site structure along with its abundant links to Google’s Javascript for adwords could easily be understood as something I would rather not see. Learning will be key to the future of search.

I mentioned understanding the context of my words without me providing context (rails). That implies that search engines will need to figure out some type of semantic meaning from pages other than just words and what words are near them. That’s a problem that some are already attempting to solve. Its a huge scalability problem though since parsing semantic meaning takes much longer than a simple dumb indexing of words like Google does. The future of search will definitely include semantic meaning whether it just be a more sophisticated word indexing that effectively achieves semantic understanding or one that truly parses out sentences for parts of speech and such. Combine that with a little machine learning and you have yourself a pretty good search.

Finally, some suggest that social bookmarking and rating sites such as Reddit are the future of search. I disagree. Mob rule is never good. However, if it were to create a hidden set of like minded individuals for me (based on who means what with their search terms) it could get a better understanding of who I am and what I mean when I say lisp. Then again, what happens when I’m a geek all my life and I suddenly have a kid who has a lisp. Will it always be up to the user to figure out how to find their results? Will businesses and individuals always be able to ruin search engines with junk sites that have figured out the algorithm? So far that’s the case. A little learning and a little semantic understanding should do the trick though.

Problems with non-english characters in urls

Thursday, September 27th, 2007

I have a site that used non-english characters in the url. They were basically characters in Spanish for the names of things. Some had little accent marks on them and such. Anyway, everything worked fine in my browser using those characters in the url. My browser sees a link with the strange character and it escapes it to something like %3d. Great. The problem though, is that if I change my default character encoding to say Traditional Chinese that same character gets escaped into something completely different like %8f. That’s no good because when they try to visit that url it doesn’t always go to the same page. Why? I’m not entirely sure but I suspect its Apache or Rails translating that url using a certain character encoding.

Logic would tell you that I should just put the character encoding in the html headers right? Yes, that works in theory. Everything works in theory though. In practice, not every browser or spider actually listens to that. I mentioned spiders because some spiders will automatically assume a particular character encoding and do the same thing as a browser with the default character encoding set. What to do? What to do?

My solution was to just get rid of all non english characters. No one is accidentally escaping an ‘a’ to %ef. So far the solution is working out fine. I don’t entirely like the urls now but its better than having characters being escaped improperly by browsers and spiders.

one-to-many associations made easy with ActiveScaffold

Friday, September 7th, 2007

I just dove in and started using ActiveScaffold for a new project. There was a little learning curve since I was doing that along with using RESTful Rails. I just started with a simple 1-many association. I setup my 2 models as usual with has_many :ads and belongs_to :affiliate. Then I created two controllers that just had something like:

And finally added this to my routes:

When I went to http://localhost:3000/affiliate I was just amazed to see it actually worked. It let me create my affiliate and add ads to that affilaite on the fly. All my crud operations already done without having to manually link them in the controller like I had been doing in previous projects. I’m not sure how well its going to scale with the project in the long run but it certainly is an improvement over the traditional Rails scaffolding and I highly recommend giving it a try.

My beef with the Google god

Friday, September 7th, 2007

Google called me out of the blue the other day asking if I wanted a job. It sounded like a good idea at first so I followed through with my updated resume and such. At some point they said they had 3 separate positions for me and that I should try to get in their core team first. I had a brief interview with the core team where they judged my qualifications based on 3 questions. I don’t remember what they were but I answered them all wrong. Well, the second one I didn’t even try and just said I don’t know because I was pissed that they were giving me a pop quiz and I got the first one wrong. No googling. After that interview it took less than 1 minute to find the answers. I never responded to the other requests from them because I don’t really want to work for a company who is so full of themselves that they honestly think that pop quizes are the best way to weed people out. Pass. Good luck though google. Now, onto the real meat of this post.

Google’s index seems like its continuously updated. That’s great for search if there were so many junk results.

  • Internal linking - Google loves internal linking which is one reason why there are so many junk results in their search. From my experience, tossing in link dumps all over your site actually helps it do better. The result? Everyone link dumps and gets better rankings so you get a bunch of crappy search results.
  • Sensitivity - For get search for now, lets focus on Google from a developer’s perspective. Google continuously updates their index. Great. What does that mean for a developer? It means that if you forget an apostrophe on an anchor tag you end up with dozens of 404s. No big deal if you catch the mistake early right? Wrong. What happens is the missing apostrophe bleeds the link on to whatever follows causing invalid links. If the Google god happens to see your mistake they will try adding those invalid links to their index. They won’t be valid so you will be penalized for having 404s on your site. That’s a sure fire way to see your rankings drop off the map for some ridiculous mistake that was corrected a few hours after it was made.
  • Poor tools - Luckily Google provides you with a way to remove invalid links from their index but good luck using that thing. Lets say that one mistake created 50 404s across your site. You have to copy and paste each 404 url to the url removal form, one at a time. Not only that but you have to remove the domain name from the pasted version. So its copy paste edit, copy paste edit, 50 times in a row. Yay! Or you can copy paste them to a text editor and global replace the domain and then copy paste them into the removal form 1 by 1.
  • Poor responsiveness - Ok so what? At least they provide a way to remove your urls from their index instead of waiting around for weeks right? Well, kinda of. Its not the same continuous updating that they do themselves. They’ll eventually listen to your request but only on their time. When they’re good and ready. I’ve had a request pending removal for over 2 weeks. That’s 2 weeks of being penalized for a missing apostrophe that was only live for less than 1 hour. Way to go Google.

This wouldn’t be complete without some suggestions. First, get a clue about hiring good developers. You’re going to eventually end up with a crap gene pool like Microsoft and Yahoo and be usurped by my new search engine. Next, don’t be so harsh on the occational 404 and at least provide a quick way to remove them. Your index is updated continuously, if you want feedback then use it. Don’t sit on my feedback for weeks. Next, give your web developer tools some love. They’re so primitive with little thought put in to usability. Also, internal linking shouldn’t count for nearly as much as you give credit. Look at the sites coming out these days. Huge link dumps that people pass right over. You’re forcing the creation of millions of junk sites on the internet. That’s not a good thing. And finally, your search obviously uses some type of machine learning and appears to be in a rut. People have your search figured out and are taking advantage of that. You need a smarter machine learning algorithm.

class_table_inheritance with acts_as_taggable

Friday, August 24th, 2007

If I have:

The problem is that:

returns all the Products tagged with ‘test’ rather than all Subproducts. So I tried:

but that does the same thing. The problem boils down to the find_tagged_with! method using acts_as_taggable_options[:taggable_type] which is defined as class_name_of_active_record_descendant elsewhere in the acts_as_taggable plugin. The solution is to rewrite find_tagged_with! to :

This should work even for classes that aren’t using class table inheritance since it’ll just use the class name.

Refs: http://wiki.rubyonrails.org/rails/pages/ActsAsTaggablePluginHowto

Class table inheritance problems

Thursday, August 23rd, 2007

I’ve successfully implemented class table inheritance but came across a problem for one instance. I’m using this to simplify my database structure. I have one model that I’m using it for that has no additional properties other than the stuff its inheriting. I’m doing this to keep everything consistent with the architecture. For instance:

So anvils just inherits all the data from products and doesn’t really add to it. The problem occurs when trying to save a modified anvil record. I end up with an error in the SQL because it’s trying to do something like:

But there aren’t any local attributes to update on the anvils table. I could redo that one structure to not use class table inheritance but that would cause more of a headache than its worth. I could try to hack a solution using the class table inheritance plugin but that didn’t sound fun either. The easiest solution, and I’m not really proud to admit this, was to simply add a junk tinyint(1) column to the anvils table. That way when the update sql runs it won’t cause a syntax error.

I was looking for a way to do anvil.parent.save or something similar but couldn’t find a way to do that. That’d be a much better solution so if anyone figures that out please let me know.

Online uml and er diagramming tool

Wednesday, August 15th, 2007

I’ve been searching for a free uml diagramming tool for Linux for a long time. There are plenty of options but nothing really stable or good enough for what I need it for. I stumbled on Gliffy today. Its an online diagramming tool written in Flash. Best of all you can use it free given some minor limitations. I tried it out and its exactly what I’ve been looking for to help collaboration between remote developers. I definitely suggest you check it out.

Computational justification for the use of meta descriptions and keywords

Sunday, August 5th, 2007

Search engines have a lot of work to do crawling the web constantly. There must be a lot of computational power required to constantly parse html pages and grant rankings for the enormous number of sites now out there. As such, it makes perfect sense for a search engine to want to speed up that process in any way it can. The use of meta descriptions and meta keywords help search engines speed up their algorithms by not having to parse your entire page. It just has to read the header information and it can move on.

The problem is that people realize this so they do a bit of keyword stuffing to try and give them a boost. Search engines don’t simply ignore your page when you use keywords and descriptions. They just don’t parse the entire page as often if you’re meta keywords and meta descriptions match the content on your page. If they don’t match, of course your site will require more processing because they have to parse the entire page and not just trust your keywords and descriptions.

The use of meta tags saves search engines tons of time. Since you do them a favor, they do you a favor and you get higher rankings.