人造“锟斤拷”

锟斤拷

程序员的段子 “手持两把锟斤拷,口中疾呼烫烫烫”

“锟斤拷”来源于Unicode字符集和GBK字符集对同一段编码的不同解释。


Unicode用了一个占位符\uFFFD来表示一些其他语言在Unicode中暂时无法表示的字符。\uFFFD经UTF-8编码后就是\xef\xbf\xbd,占3字节。

当这个’\xef\xbf\xbd’连续出现2次以上的时候,例如\xef\xbf\xbd\xef\xbf\xbd,此时就出现歧义了:

在每字符3字节的UTF-8中,它会被以3字节一组的方式正常解码为[FFFD] [FFFD][�] [�]

在每字符2字节的GBK/GB18030中,它会以2字节一组的方式解码为[EFBF] [BDEF] [BFBD][锟] [斤] [拷]


现在的OS和Browser都会判断字符编码,所以锟斤拷已经几乎灭绝了,想看就只能自己造锟斤拷了。

Python为例。

首先把Unicode的占位符重复两遍,然后用UTF-8 encode。

str = (u'\uFFFD'.encode('utf8')*2)

再指定用GBK decode

print(str.decode('gbk'))

这样就能获得一个崭新的锟斤拷了!

Bypassing ISP Incoming Traffic Restriction on Port 80

A couple of days ago, my friend Jeffery registered a new domain name, and he pointed that domain to his server located in his apartment. But his ISP, Cox Communications, blocked some common ports, namely 80, 443. Basically, with these ports being blocked, one could not easily run an Internet-accessible website. As a result, Jeffery had to append the port number explicitly, which is not only a pain in life but also a significant impact on SEO. So he came to see if I can help him out.

Continue reading “Bypassing ISP Incoming Traffic Restriction on Port 80”

URL Shortening

Two days ago, I had something that I would like to share with my friends, so I uploaded it to my personal file sharing server. But the sharing URL was long, about 60 characters. I thought, if I could provide a shorter and easier-to-remember sharing URL, probably more people will visit the shared file. But none of the existed URL shortening services satisfied my need. I end up making my own URL shortening system.

Continue reading “URL Shortening”

OPT Application Status Estimation

TL;DR

Use AJAX to send queries to a proxy page that scrapes the USCIS website. By checking application receipt number range, we can get an estimate of application processing time. USCIS actually took 92 days to process my application, which is exactly on the same day I have expected them to process my application.

Continue reading “OPT Application Status Estimation”