Grok 3 excels in speed and mathematical ability, but lacks the product ecosystem that its competitors have, such as ChatGPT, Claude or even Le Chat
I have spent a few hours testing Grok 3, the new version of xAI’s AI. I wanted to see its real capabilities and, above all, how it behaves, what kind of results it gives, compared to ChatGPT, Claude, Le Chat, DeepSeek…
Reasoning and problem solving
- It excels at maths problems. I had it complete the AIME’24 challenge, of which it got 6 out of 15 problems right, compared to the 9 correct answers of OpenAI’s o3-mini-high. In addition, Grok 3 took just under five minutes, but o3-mini-high took almost six. It is very striking to see its self-evaluations until it finds the correct answer (although sometimes it was not).
A fragment of the steps taken by Grok 3 to evaluate his own conclusions before presenting them as the final result.
- In basic reasoning tests, such as determining the number of repeated letters in somewhat complex words (the classic “Lollapalooza”) or comparing decimals (9.11 vs. 9.9), Grok 3 answers correctly after a few seconds of visible “thinking”.
o3-mini-high got it right after 6 seconds. Image: Xataka with ChatGPT.
Grok 3 also got it right, but after more than four times as long. Image: Xataka with Grok 3.
- In a question about Greek mythology about Jason’s maternal great-grandfather, Grok 3 found the correct answer in 18 seconds… while o3-mini-high needed 22 seconds to get it wrong. Well played, Grok. Grok 3, on the other hand, gave a better constructed answer, as well as being correct. And in less time.
Search and synthesis
- Its DeepSearch function is fast but sometimes it is not entirely accurate and does not mention some important details. I asked it to analyze the impact of AI on chip design and, although it generated a 1,504-word text with several quotes in just over a minute, it failed to mention important advances such as Google’s AlphaChip framework. In subsequent and insistent attempts, it did so.
- I also asked him for a full report on Xataka in terms of finance, media, reputation, etc. It was quite accurate, although it showed a limitation inherent in any Deep Research system: it knows a lot of what is in the public domain, but it does not have many insights into it, it lacks the expert criterion of someone who knows not only what is in the public domain, but what underlies it. This is something of Grok and anyone else with Deep Research. When you ask for information about something you don’t control, it’s easy to assume that Deep Research (or in this case, DeepSearch) gives you everything. When you’re in the know, it’s easy to spot shortcomings. Like in this example.
- The speed is impressive: it’s noticeably faster than OpenAI’s Deep Research… but at the cost of sacrificing depth for speed. Mind you, its selection of sources and citations is usually really good.
- Unlike Gemini, it does not allow you to export reports directly to documents or customize the focus of the research. Again: Grok is very intelligent and capable, in its own way, but it lacks product. A great LLM is of little use if it forces you to start from scratch and process all the information by hand.
Creativity and tone
- To test his creative writing I asked him for a story about a time traveler facing a paradox. The result was quite solid in character construction, details, descriptions and atmosphere, surpassing even what I consider the best in that aspect, Claude 3.5 Sonnet. Mind you, some plot twists seem quite forced.
- Its humor is basic and predictable, limiting itself almost all the time to fairly obvious wordplay. Teenage humor. If the concept of the uncanny valley can be transferred to a chatbot, Grok 3 is in that 99%: too clever to seem like a candid robot, too predictable to be completely convincing.
- It maintains political neutrality even on issues such as immigration or trans rights. Musk says it can be politically incorrect, but it seems that this has more to do with what the user requires than a personality trait. In other words, it can be made to be politically correct, but only when the user pushes it to do so.
Some limitations
- Unlike ChatGPT, it does not allow you to customize the model’s behavior or response style, as Claude does.
- It is limited to being a text box. It only has buttons to attach a file, activate its DeepSearch or activate its reasoning mode. That, and a few basic instructions. No projects like Claude or the GPTs of ChatGPT, nor the agents of Le Chat. In short: nothing that allows you to retain pre-established contexts and guidelines or documentation to facilitate the work. We always have to start from a new canvas.
The interface is good, intuitive, simple… but it lacks the tools that would make it more versatile and desirable for integration into our daily lives. It is powerful and capable for specific uses, but the product built around ChatGPT, Claude or Le Chat (projects, agents, previous instructions, etc.) make those alternatives much more interesting for serious and recurrent use.
- The security guards are stricter than those of Grok 2. With that version we were amazed by its lack of scruples, but Grok 3 seems to have them back: it refused to generate a template for me for a mass email fraud campaign pretending that I am a Valencian prince looking for an heiress.
- Image generation does seem, once again, more lax. Midjourney does not allow you to create anything that contains the words “Donald Trump” or “President of the United States”. Nothing. Grok 3 is not so fussy. Not even with its owner.
You can try Grok 3 from its official website or from its integration in X (which is why you have seen two somewhat different interfaces in this article). It is temporarily free, but we already know that it will be one of the reasons to pay for a subscription to X, and not one of the cheap ones.
Its capacity is undeniable, but we have so many similar alternatives on offer that being a little more intelligent or faster is not what sets it apart. The difference is made by the product, and that is where Grok 3 has more room for improvement.