On natural language processing

Wouter was wondering about natural language processing. I have got quite interested in that field, although I also lack any real knowledge on that except for a couple of quite simple articles I have read and talks I have attended. A great resource on this is Alexander Gelbukh - I saw him at his talk in the 30th CLEI conference in Arequipa, Perú. He has some quite interesting articles about NLP on his web site, although they are in Spanish (anyway, for anybody interested: Avances en análisis automático de textos and Tendencias recientes en el procesamiento de lenguaje natural) but browse around, there are still many good links. The basic idea from the two Spanish articles is that NLP goes through the same basic steps that a formal language compiler goes (i.e., lexing, parsing, semantic analysis) - The main difference is that any sentence in a natural language has many implicit relations with an universe of knowledge around it, so you cannot just build a parse tree for each of the sentences - You must have a universe of concepts and fit each of the sentence parse trees of the text you analyzed in it. Of course, in order to do so, you must also solve the ambiguities that are so common in spoken language, but that's another whole topic. Gelbukh's works are, AFAICT, driven towards data mining - performing automatical analysis of many texts and coming up with conclusions that are not explicitly stated in any of them, probably with mechanisms to trace back to which pieces of information led the system to each of them. As I told you, I really liked this topic, and I intend on diving deeper into it as soon as I get out of some obligations... But I'm sure Gelbukh's page will be a interesting reading. Another project I really enjoyed (and completely unrelated to what I wrote here, its realm lies much further to the bottom, near the lexical/grammatical analysis phases) is Snowball, a free language for stemming algorithms, which has implemented stemmers for many European languages. The Snowball site has also a very nice article regarding what is stemming, how it works, and how it has grown over the time.
I am complete! / Debconf stuff

Yesterday I got my X-ray results - I am fortunately complete, no apparent injuries to my bones or discs. I got some pretty simple advice from an aunt who is a Feldenkrais instructor, and yes, it still hurts and I still look funny when standing up/sitting down, but it is getting better. Sadly, I still haven't received the medical insurance papers from UNAM, so those MX$2300 for a complete set of X-rays were off my pocket. I'll see what can I do to pay some other things I have pending :-/ Anyway... I'll manage. [code="sh"]for i in seq 1 100 do echo "I should not inspire pity" done [/code]
Yesterday was a productive day. I did the finishing touches to allow for registration for Debconf using a branch off Comas - There are still details to fix, so the registration is still not in the official server, but I hope to have it soon. Why am I blogging about it and getting more people to whine on why isn't it ready yet? Because, at best, that will cause more pressure on me to have it ready soon ;-) Guys, see you in HEL! I also started yesterday asking my Debianmexico friends for suggestions on where to make Debconf6 - Oaxtepec looks like the winner, but there are many other possible choices (and many still unexplored). Strangely, most of them seem to be IMSS-owned. I hope to go to meet those places soon, I'll keep you posted.
Rodrigo, you should know that I take you as one of the prime examples of nerdiness in Mexico. I am amazed you think the same about me :)
Jesús: I don't understand how can you believe that Tlalnepantla is better than ${place}. I worked in Tlalnepantla for some four years - It is just like a little town inside Mexico City, but without the beauty. It is one of the places I'd gladly omit from my memory. Good look in Guadalajara!
Meme, meme, meme time! / My language of mine / Driving++

I am 87% loser. What about you? Click here to find out! I am nerdier than 94% of all people. Are you nerdier? Click here to find out! What is your weird quotient? Click to find out! So it turns out I am a big loser, a VERY big nerd and quite a weird guy. Does not really surprise me - Well, I feel happy about being that nerdy (and, yes, I answered with the truth) - How the hell did Steve, Alexander, Per-Arne and Martin make it? My hat off to you guys! (OTOH, it should not surprise me having that many über-nerds in Debian ;-) )
Isaac blogged about lack of a way of not using possessives in English, and he says that he thinks that Spanish has the same problem - In theory, yes. In practice, it is even greater. It is very common to hear people say, for example, su mamá de él (his mother of him) instead of simply su mamá (his mother), as there is ambiguity between formal-second-person and third person. Yes, that's a mistake and cannot be tolerated in educated circles - but nevertheless, it is very common.
My [term]coccyx[/term] still hurts badly since last Wednesday's fall. Yesterday I went to have some X-ray shots of my whole column (at US$200, they'd better be worth something!), I'll get them today to a doctor. But at least something good came out of it: I was not in shape to drive there, so Nadezhda took the car for the first time into the wilderness of Mexico City. Congratulations, Cosa! :-D She was so happy and confident she even gave a ride home afterwards to her sister and nieces, who dropped by to visit us.
[friend]ion[/friend]: You are insane - But your ideas rock!
Forget your keys

Amaya's post made me remember one of the most stupid, boring, frustrating days of my existence. Yours, at least, doesn't sound _that_ bad. About two years ago, one Saturday morning, the Debianmexico crowd scheduled its first meeting - 10AM, some 20 minutes away from my home. As the only DD in Mexico, it was my task to prepare the material for the meeting. What was I talking about? A simple introduction on making .debs. By then, we were renting the lower half of a small house in San Pedro de los Pinos. The house split was strange: We entered through the street door to a very little garden (about 4 square meters) and a little room (about 5 square meters); to the right there were the stairs to our neighbors' half, to the left there was our apartment door. About 8:30, Nadezhda left - I don't remember what she went to, some course about something... But at about 8:45 the doorbell rang - I thought it was her, forgetting something. I put on my pants and went to open the door. Just after I closed my apartment door, I realized the keys were on the table. And the person outside was not Nadezhda. It was just someone passing by. Well, to make things short, I was stuck. I managed to open the house's window, but there were security bars, and I could not get in. My keys were four meters away from my hands. I was shoeless, moneyless... A neighbor kindly tried to help, but with no luck. Darn... I would have easily traded my five cats for a single chimpanzee able to understand and give me the keys. I was late for my talk - no, wait, I didn't get there at all. And I told Nadezhda we'd meet at her office - She was waiting for me there until she got pissed. Around 18:00 she got home. I spent one of the most stupid, worthless days of my life waiting for her to appear, waiting for my cats to give me the keys.
On my way to work, I went upstairs to grab my glasses. As I was going down, Tin Tan was drinking water from his plate on the middle-step on the stairs (you know, that step that is wider and the direction of the stairs changes). I decided not to bother him, skipping that step. Next thing I know, Tin Tan is running upstairs, quite scared, and I am heading downstairs, quite faster than what I intended. I shouldn't skip over two steps while using my slippers - They _do_ slip. I have problems with every movement. What bothers me more is that sitting in front of a computer is almost as painful as not using a computer for a couple of days. I hope the pain just goes away soon.
I really liked Luciano's suggestion for my post (regarding John Goerzen's): You can even use Festival to read your day's to-do list out loud first thing in the morning!
Missing appliances

John Goerzen comments on his quest for a good alarm clock. You want a good, geeky alarm. Whenever I am away from home, I always count on [code="sh"]echo $LOUD_NOISE_CMD | at $WAKEUP_TIME[/code] It always works. Be it at a hotel room with my trusty old laptop, be it at home with your powerful server connected to the stereo system with your favorite punk rock music, it is guaranteed to work - and, as you request, with a nice form-factor for your computer, it is as geeky as it gets. ...But you got me thinking into this: I am quite frustrated. I wanted to buy a telephone answering machine. Nothing fancy, I just needed to replace the one I had for many years and which died some months ago. I cannot believe this: I went to at least five different stores which carry an electronics department. I went also to Radio Shack and Steren. I just cannot get any answering machine without a wireless phone built in. Why don't I want a wireless phone? Because I need a ~US$30 thing, not a ~US$100 one. And, sadly, I need a bit more equipment (and work) to get my computer to function as such machine than what I need for it to become an alarm clock.
Back to work / News that make you shout in anger

Ok, so -effective yesterday, January 6, as we were on vacations until that precise day- I am finally hired at IIEc-UNAM (Economics Research Institute at the Mexico National Autonomous University). Life looks rosy and beautiful. Being an academic worker of a big university makes you... ...Do paperwork. I spent two days preparing my workplan for this year. Then, my boss told me I used the wrong format, that I didn't need to include the justification, only the points I intend to cover. Ok... Well, it is done now - But, as I have already worked at UNAM, I know this is only the first of many, many papers I will move in the next years. Fortunately, I was able to do some real job as well. IIEc really surprised me - I was hired mainly as a sysadmin - But there are currently no services in the institute. The mail accounts are handled externally. Even the Web page is in an external server. Some groups have started setting up their servers - Well, the first point in my workplan is to restructure the Institute's severs - Provide here all the services that are currently provided in DGSCA, and consolidate the different services offered by different groups into the servers under my control. And just today I stumbled upon a group that was just requesting to buy a server for their database, explained them the benefits of having a single administration, convinced them to set up their services in my server... I hope this gets me at least some extra RAM or speed for the server _I_ want to buy ;-)
On a very different topic: I must express my regret and anger. Reading my favorite newspaper, I see (and in the back cover, no less!) that after 15 years of work, the Mexican Simpsons dubbing team will be fired because Grabaciones y Doblajes Internacionales, one of Mexico's main dubbing companies, refuses to hire people who have joined the ANDA (Asociación Nacional de Actores, National Actors Association) union. This is an illegal measure. And it will destroy one of the finest dubbing works that we have. I really fear the result. Yes, I am a Simpsons junkie. This problem really saddens me. Sue me.
The most transparent region

In 1917, Alfonso Reyes (more info in Spanish) started his most known poem, Visión de Anáhuac (WTF... Cannot find a single online copy of the poem?!) with the following words: Viajero: has llegado a la región más transparente del aire (Traveller: You have arrived to the most transparent region of the air). This poem, of course, refered to the breathtaking view of the Anáhuac Valley, on which Mexico City grew. Yesterday I went with Nadezhda to my father's house in Cuernavaca. This morning, as we came back (it feels quite strange to be on the road January 1st, 8AM :) ), we felt Mexico City was more polluted than normal. Much more. More even than in the worst 1989 days. We got home - It smelt like burnt wood or something like that. Nadezhda was scared, went quickly to check if the house was still complete - Fortunately, it was. But it turns out that so many people had firecrackers and lit bonfires to greet the new year that even the air inside my house was foggy. Yes, we had quite an obvious thermal inversion, as from the Southern hills the view of the Eastern and Northern hills was quite decent... But this was way over the line! People, specially Mexicans, specially [term]chilango[/term]s: Please, be more conscious! This was quite a frightening sight!
I hate deb-slashdot...

I wanted to create a little new meme. I posted a world map with visited countries. It was nice. It was good. The meme started to spread... And then the server choked after a deb-slashdotting :-(
Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /projects/visitedcountries/. Reason: Could not connect to remote machine: Connection refused
What can I say? Sorry, douweosinga.com :-(
Is it slow? Nah, it is just December

Almost two weeks, not a single post? Well, yes, life tends to slow down for the holidays. Although I have been mostly working, at least a couple of hours every day, I have managed to do very little this last weeks. There has, though, been some activity. I have sat down to hack on Comas quite a bit. [friend]Mig[/friend] has done his good share as well. I am happy about it. And we are gathering more people in the project - Comas will be used for the Bolivian Free Software Conference, for a German Perl workshop... And it seems we are getting more hands. This is getting fun! :-D I was also busy printing and selling with Nadezhda some very nice Debian T-shirts, just like the ones I took to the last GULEV conference. Very nice T-shirts, I really liked the result, and I have sold them at a nice pace - even being vacations and all. Now... While browsing around, I came around the very nice Visited Countries site. Of course, my ego did not let me go away just like that... So here it is: ...Although it seems quite unfair that a visit to Montreal or a visit to Porto Alegre are good for such a huge impact area, this little map is something I long wanted to do. Now, why doesn't he have one of the Mexican states? ;-) But nothing beats the world-map-on-a-corkboard-with-lots-of-tacks I want on my wall ;-)
WTF is this?

Ok... If you know me, you will be perplexed to find this as my current desktop. Yes, people that know me know that I dislike integrated desktop environments. I am a very happy WindowMaker user, have been faithful to WMaker for at least seven years... But anyway, I have found myself recommending Linux to almost-average users... So I decided to force myself to use a computer as they would for a couple of days at least. I am trying to get the whole user perspective, even the settings they would use (i.e., graphical smileys in Gaim... That's bad!), even using one of the pieces I most often loathe in favor of the traditional terminal: The file manager. Well... Almost anything - Just don't take away neither mutt nor Emacs. So far, after some four hours and thinking as a user, I like Gnome. I do think some things should be different, but before screaming about them, I'll play more with it. I had not used it since... The 1.4 days, I think. 2.8 has just entered Debian, and it is amazingly smooth. I plan to submit myself at least to two days of Gnome torture, then two days of KDE torture... If time allows, I'll even torture myself again with xfce, although I tried to do so in the past, and never liked it :)
The stars above us

So today we have the Geminid meteor shower, right? Well, what could we do besides printing a sky chart of Mexico at 20:00 (actually, I printed it for 18:30, but was able to infer +- the position of Gemini) and drive to our dear and nearby Ajusco. We got to a nice spot, with no light around us besides the road (which was quite bothering, but bearable), and... Well, we had a nice view of the Southern and Western parts of the sky... But the East and the North were cloudy, and Mexico City was just North of us, so the city lights reflected on the clouds... So after some minutes, we headed back home. When we were mostly back into civilization, Nadezhda told me she was hungry. And you know how hard it is for me to please my woman when she wants food... Some days ago we were remembering a very good place to eat [term]pambazo[/term]s, [term]quesadilla[/term]s and such, very close to the center of Magdalena Contreras. Contreras is a beautiful (although mostly poor) area, struggling between its identity as a little town and its reality as part of a huge city. Nadezhda was born in Santa Teresa just in the border between the towns uphill and the city in the valley, between the opulency down in Pedregal and the poverty towards Contreras, and she knows quite well the area. It was very nice to hear her talk about her childhood, stories about her father driving like crazy on those impossibly twisted streets, places she went to as a child... This little restaurant is on Álvaro Obregón street, and has not changed at all since she first took me there about eight years ago. She insists it has stayed identical since her father took her and her brothers there 25 years ago. We just had our dinner, walked a couple of blocks, and came back home. Yes, you might ask why should I blog something as irrelevant as this... Well, the thing is, I really enjoyed the evening out :-) BTW: I thought I would never see a harder place to drive in than Sucre or Potosí, in Bolivia... Well... Contreras does not fall behind :)
Pascualina / EsMasPC

It seems that after yesterday's rant SEPOMEX decided to stop playing me tricks, and they finally left me at home a final call stating I could go to my area's post office (which is not in my area at all, there are at least two post offices much closer, but anyway) to pick up a package. Now... A final call? Yes... I never got the first or second ones, and my package was about to be either sent back to the sender or discarded. Well, I went to the post office, got there past 16:55 (it closes at 17:00 - and believe me, mexicans are really punctual when it comes to going home after work). I was expecting this package for a long time: Five Pascualinas I asked the good [friend]MAVE[/friend] (this guy)to send me for my nieces when I was in Chile! I want to open them, but of course, it is not up to me - The girls must do it. Thanks a lot, man! :-D Later that night, [friend]Arareko came for a T-shirt I had promised to keep for him, and [friend]Kbrown[/friend] came to show me an EsMas PC, as I was quite curious about it. What is this EsMas PC? Well, first of all: EsMas (literally: ItIsMore) is the Internet name for Televisa, the largest commercial TV chain in Mexico. This computer they sell for around US$250 is the first attempt I saw at making the PC into a commodity - Clearly following the iMac's design, it is a fully integrated unit. Now, just as the original iMac, it is a very dated machine - 300MHz Celeron, 64MB RAM (of which 8MB are allocated to video). The interesting thing is that they ship the system with Linux - And not just any linux, it is a Debian Woody system with KDE 3.2, Gnome 2.4, Mozilla 1.4, OpenOffice 1.1beta2, and some extra propiertary stuff (Netscape 7.1 IIRC, RealAudio player). It comes with a nice (although cheap-feeling) USB keyboard/touchpad that has just the exact laptop size and arrangement. I love my laptop's keyboard, so I'd like to get hold of one of those - Except that it lacks many keys, it is very similar to the HappyHacking keyboard. It has only five rows of keys - That's right, no Esc, no Function keys, cursor keys are only the arrows (no PgUp/PgDn/Home/End). Wait - It does have function keys... Only they masquerade as extra launcher buttons. They are not mapped correctly - Home is F1, Network is F2, and so forth. Silly. The offer seems quite good (although with limited hardware), and I asked Kbrown to lend me the machine for a couple of days just to test it, once again, thinking about my nieces - they would really like having a computer like that. But I soon got disappointed. The machine is really slow. It would be much better if they cared to ship it with 128MB instead of just 64. We measured it, and just after opening, Mozilla used some 25MB. Opening a page with a Java applet required the JVM - 30MB more. Add to this X and Metacity, and... Well, happy swapping. Oh, and don't even try to open OpenOffice as well. And if you do, make sure that's it - I opened some other programs... And the kernel decided to kill X as it ran out of memory. Kbrown has this machine because he wants to offer EsMas the quite amazing computadora.de service he is working on - He is no Debian user, so he came to me to help him install Firefox, hoping it would be lighter. Well... Upgraded the machine to Sarge. It took a couple of hours, but in the end it worked. We finished at 4:30 AM. The results? Well, nowadays Firefox is as resource-hungry as Mozilla. I would just not recommend this machine to anyone for any use. The machine is also not usable as a terminal for computadora.de. Probably with 128MB the system would be quite usable for many more people, and with 256MB I would definitively recommend it. Well... Off to bed at 4:45 AM. Woke up at 7:45, as we had some things to do in the morning. For some reason we don't have running water at home today, so no shower for me. Back home, I was falling asleep. I had a cup of strong coffee, some Bolivian coca tea, and... Well, I am still longing for my morning shower :-(
So Mexpost is...

Unbelievable. I have a package pending to be sent to my mother, who lives in Sweden. I have delayed a lot with it... After all, I have to get to a Mexpost office to have it delivered. Mexpost defines itself as an accelerated courier service. It is part of SEPOMEX, Mexico's postal service. Traditionally, it has been the cheapest courier service in Mexico. I would not trust sending five CDs and a book by regular mail... Ok, so today I got to a Mexpost office. Twenty minutes to get the lady in charge to write down the data on the computer (some ten lines of text - Of course, I had to help her to write Förläggerevägen ;-) ). Only then, she tells me it costs something around MX$360 (some US$34) to send this 300 gram package. Shit, I don't have enough money on me. And, of course, at a public office she will not accept my bank card. From my office, minutes later, I call DHL. Yes, they will pick it up at my home. Yes, they assure me it will take only 2 work days to be delivered. Yes, they will charge me - MX$320. ...Now, why does SEPOMEX complain it is losing clients, is it for the higher prices, for the lousier service, for the hardness to reach their office, or what?
MD5 to be considered dangerous?

Today I found a quite disturbing mail sent to Bugtraq, in which Dan Kaminsky shortly describes a way to generate more than one file with the same MD5 hash, and links to a paper explaining it further. And if that were not enough, Pavel Machek sent another mail telling a little story and demonstrating Kaminsky's claims with a little story about a scam. I checked the files attached to his mail, and yes, we have two similar (but different) files with the same MD5 hash: [code="bash"]~$ md5sum /tmp/msg1 /tmp/msg2 ; diff --brief /tmp/msg1 /tmp/msg2 57ce330a6c6ca8e9ffab4f3b36b2a1a5 /tmp/msg1 57ce330a6c6ca8e9ffab4f3b36b2a1a5 /tmp/msg2 Files /tmp/msg1 and /tmp/msg2 differ [/code] This attack is still not practical for real scamming or supplantation. If we are signing files that will be processed by a computer (say, Debian packages, .tar.gz, ISO images, whatever), they will not be in a valid format to be installed. If, as in Machek's story, the files are to be human-parsed, there is too much cruft around the text for a human not to get suspicious. But anyway, this is a proof of concept, and it will surely be refined in the future... All hashing functions will somehow present collisions, I know. They must, however, not be artificially generable with choosable content. I am not a cryptologist, nor I claim I will ever be.. But anyway, probably we will end up losing confidence in MD5 hashes, in favor of another hashing algorithm. Directly signing/verifying the whole file is not quite feasible, as assymetric keys are just too heavy to do such work. However, the installed base and trust that MD5 currently has will be challenged... Let's see what comes out of this.
