Problems with non-english characters in urls

I have a site that used non-english characters in the url. They were basically characters in Spanish for the names of things. Some had little accent marks on them and such. Anyway, everything worked fine in my browser using those characters in the url. My browser sees a link with the strange character and it escapes it to something like %3d. Great. The problem though, is that if I change my default character encoding to say Traditional Chinese that same character gets escaped into something completely different like %8f. That’s no good because when they try to visit that url it doesn’t always go to the same page. Why? I’m not entirely sure but I suspect its Apache or Rails translating that url using a certain character encoding.

Logic would tell you that I should just put the character encoding in the html headers right? Yes, that works in theory. Everything works in theory though. In practice, not every browser or spider actually listens to that. I mentioned spiders because some spiders will automatically assume a particular character encoding and do the same thing as a browser with the default character encoding set. What to do? What to do?

My solution was to just get rid of all non english characters. No one is accidentally escaping an ‘a’ to %ef. So far the solution is working out fine. I don’t entirely like the urls now but its better than having characters being escaped improperly by browsers and spiders.

Leave a Reply